Skip to main content

2025-05-07-14-36

Holmes: Automated Fact Check with Large Language Models

Abstract

arXiv:2505.03135v1 Announce Type: new Abstract: The rise of Internet connectivity has accelerated the spread of disinformation, threatening societal trust, decision-making, and national security. Disinformation has evolved from simple text to complex multimodal forms combining images and text, challenging existing detection methods. Traditional deep learning models struggle to capture the complexity of multimodal disinformation. Inspired by advances in AI, this study explores using Large Language Models (LLMs) for automated disinformation detection. The empirical study shows that (1) LLMs alone cannot reliably assess the truthfulness of claims; (2) providing relevant evidence significantly improves their performance; (3) however, LLMs cannot autonomously search for accurate evidence. To address this, we propose Holmes, an end-to-end framework featuring a novel evidence retrieval method that assists LLMs in collecting high-quality evidence. Our approach uses (1) LLM-powered summarization to extract key information from open sources and (2) a new algorithm and metrics to evaluate evidence quality. Holmes enables LLMs to verify claims and generate justifications effectively. Experiments show Holmes achieves 88.3% accuracy on two open-source datasets and 90.2% in real-time verification tasks. Notably, our improved evidence retrieval boosts fact-checking accuracy by 30.8% over existing methods

摘要

互联网普及率的提升加速了虚假信息的传播,威胁社会信任、决策制定和国家安全。虚假信息已从单一文本形式演变为图文结合的多模态复杂形态,这对现有检测方法提出了挑战。传统深度学习模型难以捕捉多模态虚假信息的复杂性。受人工智能进展启发,本研究探索利用大语言模型(LLMs)实现自动化虚假信息检测。实证研究表明:(1)单独使用LLMs无法可靠评估声明真实性;(2)提供相关证据能显著提升其性能;(3)但LLMs无法自主搜索准确证据。为此,我们提出Holmes端到端框架,其创新性证据检索方法可协助LLMs收集高质量证据。该方案采用:(1)基于LLM的摘要技术从开放源提取关键信息;(2)新型算法与指标评估证据质量。Holmes使LLMs能有效验证声明并生成论证依据。实验表明,Holmes在两个开源数据集上达到88.3%准确率,实时验证任务中达90.2%。值得注意的是,我们改进的证据检索方法将事实核查准确率较现有技术提升30.8%。


BLAB: Brutally Long Audio Bench

Abstract

arXiv:2505.03054v1 Announce Type: new Abstract: Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limited exploration of long-form conversational speech segments that more closely reflect natural user interactions with these models. We introduce Brutally Long Audio Bench (BLAB), a challenging long-form audio benchmark that evaluates audio LMs on localization, duration estimation, emotion, and counting tasks using audio segments averaging 51 minutes in length. BLAB consists of 833+ hours of diverse, full-length audio clips, each paired with human-annotated, text-based natural language questions and answers. Our audio data were collected from permissively licensed sources and underwent a human-assisted filtering process to ensure task compliance. We evaluate six open-source and proprietary audio LMs on BLAB and find that all of them, including advanced models such as Gemini 2.0 Pro and GPT-4o, struggle with the tasks in BLAB. Our comprehensive analysis reveals key insights into the trade-offs between task difficulty and audio duration. In general, we find that audio LMs struggle with long-form speech, with performance declining as duration increases. They perform poorly on localization, temporal reasoning, counting, and struggle to understand non-phonemic information, relying more on prompts than audio content. BLAB serves as a challenging evaluation framework to develop audio LMs with robust long-form audio understanding capabilities.

摘要

开发能够理解多样化语音交互的大型音频语言模型(LM)对于适应人类沟通的多模态特性至关重要,并能提升语言技术在不同用户群体中的可及性。当前音频语言模型的研究主要针对短音频片段(通常低于30秒)进行评估,而对更贴近用户自然交互的长篇对话语音段落的探索仍显不足。我们提出"超长音频基准测试"(BLAB),这是一个具有挑战性的长篇音频评估体系,通过平均时长达51分钟的音频片段,对音频语言模型在定位、时长估计、情感识别和计数等任务中的表现进行测试。BLAB包含833小时以上的多样化完整音频片段,每个片段均配有人工标注的文本式自然语言问答对。所有音频数据均来自允许商业使用的授权资源,并经过人工辅助筛选以确保任务合规性。我们对六款开源和商业音频语言模型在BLAB上的测试表明,包括Gemini 2.0 Pro和GPT-4o在内的先进模型均难以胜任这些任务。综合分析揭示了任务难度与音频时长之间的关键权衡关系:总体而言,现有音频语言模型对长篇语音理解能力有限,且表现随时长增加而下降;它们在定位、时序推理、计数等任务中表现欠佳,难以捕捉非音位信息,更依赖提示词而非音频内容本身。BLAB为开发具有强大长篇音频理解能力的语言模型提供了具有挑战性的评估框架。


Iterative Resolution of Prompt Ambiguities Using a Progressive Cutting-Search Approach

Abstract

arXiv:2505.02952v1 Announce Type: new Abstract: Generative AI systems have revolutionized human interaction by enabling natural language-based coding and problem solving. However, the inherent ambiguity of natural language often leads to imprecise instructions, forcing users to iteratively test, correct, and resubmit their prompts. We propose an iterative approach that systematically narrows down these ambiguities through a structured series of clarification questions and alternative solution proposals, illustrated with input/output examples as well. Once every uncertainty is resolved, a final, precise solution is generated. Evaluated on a diverse dataset spanning coding, data analysis, and creative writing, our method demonstrates superior accuracy, competitive resolution times, and higher user satisfaction compared to conventional one-shot solutions, which typically require multiple manual iterations to achieve a correct output.

摘要

生成式人工智能系统通过支持基于自然语言的编程和问题解决,彻底改变了人机交互方式。然而,自然语言固有的模糊性常导致指令不精确,迫使用户反复测试、修正并重新提交提示。我们提出一种迭代方法,通过结构化系列澄清问题和替代解决方案提案(辅以输入/输出示例)系统性地消除这些歧义。待所有不确定性均被解决后,系统将生成最终精确解。在涵盖编程、数据分析和创意写作的多样化数据集上评估表明,相较于传统单次生成方案(通常需多次人工迭代才能获得正确输出),本方法在准确性、具有竞争力的解决时效及用户满意度方面均展现出显著优势。


Assessing and Enhancing the Robustness of LLM-based Multi-Agent Systems Through Chaos Engineering

Abstract

arXiv:2505.03096v1 Announce Type: new Abstract: This study explores the application of chaos engineering to enhance the robustness of Large Language Model-Based Multi-Agent Systems (LLM-MAS) in production-like environments under real-world conditions. LLM-MAS can potentially improve a wide range of tasks, from answering questions and generating content to automating customer support and improving decision-making processes. However, LLM-MAS in production or preproduction environments can be vulnerable to emergent errors or disruptions, such as hallucinations, agent failures, and agent communication failures. This study proposes a chaos engineering framework to proactively identify such vulnerabilities in LLM-MAS, assess and build resilience against them, and ensure reliable performance in critical applications.

摘要

本研究探讨了在现实条件下,将混沌工程应用于增强基于大语言模型的多智能体系统(LLM-MAS)在类生产环境中的鲁棒性。LLM-MAS具有提升广泛任务处理能力的潜力,包括问答、内容生成、客户支持自动化以及决策流程优化。然而,处于生产或预生产环境的LLM-MAS容易受到突发性错误或中断的影响,例如幻觉现象、智能体故障及智能体间通信故障。本研究提出一个混沌工程框架,旨在主动识别LLM-MAS中的此类脆弱性,评估并构建其抵御能力,从而确保关键应用中的可靠性能。


CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics

Abstract

arXiv:2505.03171v1 Announce Type: new Abstract: Neurosymbolic approaches integrating large language models with formal reasoning have recently achieved human-level performance on mathematics competition problems in algebra, geometry and number theory. In comparison, combinatorics remains a challenging domain, characterized by a lack of appropriate benchmarks and theorem libraries. To address this gap, we introduce CombiBench, a comprehensive benchmark comprising 100 combinatorial problems, each formalized in Lean~4 and paired with its corresponding informal statement. The problem set covers a wide spectrum of difficulty levels, ranging from middle school to IMO and university level, and span over ten combinatorial topics. CombiBench is suitable for testing IMO solving capabilities since it includes all IMO combinatorial problems since 2000 (except IMO 2004 P3 as its statement contain an images). Furthermore, we provide a comprehensive and standardized evaluation framework, dubbed Fine-Eval (for \textbf{F}ill-in-the-blank \textbf{in} L\textbf{e}an Evaluation), for formal mathematics. It accommodates not only proof-based problems but also, for the first time, the evaluation of fill-in-the-blank questions. Using Fine-Eval as the evaluation method and Kimina Lean Server as the backend, we benchmark several LLMs on CombiBench and observe that their capabilities for formally solving combinatorial problems remain limited. Among all models tested (none of which has been trained for this particular task), Kimina-Prover attains the best results, solving 7 problems (out of 100) under both with solution'' and without solution'' scenarios. We open source the benchmark dataset alongside with the code of the proposed evaluation method at https://github.com/MoonshotAI/CombiBench/.

摘要

将大语言模型与形式化推理相结合的神经符号方法近期在代数、几何和数论领域的数学竞赛题上已达到人类水平。相比之下,组合数学仍是一个具有挑战性的领域,其特点是缺乏合适的基准测试和定理库。为填补这一空白,我们推出了CombiBench——一个包含100道组合问题的综合性基准测试集,每道问题均用Lean~4形式化并配有对应的非形式化描述。该问题集涵盖从中学到国际数学奥林匹克(IMO)及大学水平的广泛难度范围,涉及十余个组合数学主题。由于包含2000年以来所有IMO组合题(除含图像的IMO 2004 P3外),CombiBench适用于测试IMO解题能力。此外,我们首次为形式化数学提供了名为Fine-Eval(基于Lean的填空式评估)的标准化评估框架,该框架不仅支持基于证明的问题,还能评估填空题。以Fine-Eval为评估方法、Kimina Lean Server为后端,我们对多个大语言模型在CombiBench上进行基准测试,发现它们形式化解决组合问题的能力仍有限。在所有测试模型(均未针对此任务专门训练)中,Kimina-Prover表现最佳,在"提供解答"和"无解答"两种场景下均解决了100题中的7题。我们在https://github.com/MoonshotAI/CombiBench/开源了基准数据集及评估方法代码。


Evaluating the Impact of AI-Powered Audiovisual Personalization on Learner Emotion, Focus, and Learning Outcomes

Abstract

arXiv:2505.03033v1 Announce Type: new Abstract: Independent learners often struggle with sustaining focus and emotional regulation in unstructured or distracting settings. Although some rely on ambient aids such as music, ASMR, or visual backgrounds to support concentration, these tools are rarely integrated into cohesive, learner-centered systems. Moreover, existing educational technologies focus primarily on content adaptation and feedback, overlooking the emotional and sensory context in which learning takes place. Large language models have demonstrated powerful multimodal capabilities including the ability to generate and adapt text, audio, and visual content. Educational research has yet to fully explore their potential in creating personalized audiovisual learning environments. To address this gap, we introduce an AI-powered system that uses LLMs to generate personalized multisensory study environments. Users select or generate customized visual themes (e.g., abstract vs. realistic, static vs. animated) and auditory elements (e.g., white noise, ambient ASMR, familiar vs. novel sounds) to create immersive settings aimed at reducing distraction and enhancing emotional stability. Our primary research question investigates how combinations of personalized audiovisual elements affect learner cognitive load and engagement. Using a mixed-methods design that incorporates biometric measures and performance outcomes, this study evaluates the effectiveness of LLM-driven sensory personalization. The findings aim to advance emotionally responsive educational technologies and extend the application of multimodal LLMs into the sensory dimension of self-directed learning.

摘要

独立学习者在非结构化或易分心的环境中常常难以保持专注和情绪调节。尽管部分学习者会借助音乐、自主感觉经络反应(ASMR)或视觉背景等环境辅助工具来维持注意力,但这些工具很少被整合为以学习者为中心的有机系统。现有教育技术主要关注内容适配与反馈,却忽视了学习发生时的情感与感官情境。大型语言模型已展现出强大的多模态能力,包括生成及适配文本、音频和视觉内容的功能。教育研究尚未充分探索其在构建个性化视听学习环境方面的潜力。为此,我们提出一种基于人工智能的系统,利用大型语言模型生成个性化多感官学习环境。用户可选择或生成定制化视觉主题(如抽象vs写实、静态vs动态)与听觉元素(如白噪音、环境ASMR、熟悉vs新颖的声音),以创建旨在减少干扰并增强情绪稳定性的沉浸式场景。本研究核心问题在于探究个性化视听元素组合如何影响学习者的认知负荷与参与度。通过结合生物特征测量与绩效结果的混合研究方法,评估语言模型驱动的感官个性化效果。研究成果有望推动情感响应式教育技术的发展,并将多模态语言模型的应用拓展至自主学习的感官维度。


Patterns and Mechanisms of Contrastive Activation Engineering

Abstract

arXiv:2505.03189v1 Announce Type: new Abstract: Controlling the behavior of Large Language Models (LLMs) remains a significant challenge due to their inherent complexity and opacity. While techniques like fine-tuning can modify model behavior, they typically require extensive computational resources. Recent work has introduced a class of contrastive activation engineering (CAE) techniques as promising approaches for steering LLM outputs through targeted modifications to their internal representations. Applied at inference-time with zero cost, CAE has the potential to introduce a new paradigm of flexible, task-specific LLM behavior tuning. We analyze the performance of CAE in in-distribution, out-of-distribution settings, evaluate drawbacks, and begin to develop comprehensive guidelines for its effective deployment. We find that 1. CAE is only reliably effective when applied to in-distribution contexts. 2. Increasing the number of samples used to generate steering vectors has diminishing returns at around 80 samples. 3. Steering vectors are susceptible to adversarial inputs that reverses the behavior that is steered for. 4. Steering vectors harm the overall model perplexity. 5. Larger models are more resistant to steering-induced degradation.

摘要

控制大型语言模型(LLM)的行为因其固有的复杂性和不透明性仍面临重大挑战。尽管微调等技术可以改变模型行为,但通常需要大量计算资源。近期研究提出了一类对比激活工程(CAE)技术,通过针对性修改模型内部表征来引导LLM输出,展现出良好前景。CAE在推理阶段零成本应用,有望开创一种灵活、面向特定任务的LLM行为调控新范式。我们系统分析了CAE在分布内和分布外场景下的性能表现,评估其局限性,并初步制定有效部署的全面指南。研究发现:1. CAE仅在分布内语境中能稳定生效;2. 用于生成引导向量的样本量增至约80个时出现收益递减;3. 引导向量易受对抗性输入影响,导致预期行为发生逆转;4. 引导向量会损害模型整体困惑度;5. 模型规模越大,对引导引发性能下降的抵抗性越强。


RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation

Abstract

arXiv:2505.03275v1 Announce Type: new Abstract: Large language models (LLMs) struggle to effectively utilize a growing number of external tools, such as those defined by the Model Context Protocol (MCP)\cite{IntroducingMCP}, due to prompt bloat and selection complexity. We introduce RAG-MCP, a Retrieval-Augmented Generation framework that overcomes this challenge by offloading tool discovery. RAG-MCP uses semantic retrieval to identify the most relevant MCP(s) for a given query from an external index before engaging the LLM. Only the selected tool descriptions are passed to the model, drastically reducing prompt size and simplifying decision-making. Experiments, including an MCP stress test, demonstrate RAG-MCP significantly cuts prompt tokens (e.g., by over 50%) and more than triples tool selection accuracy (43.13% vs 13.62% baseline) on benchmark tasks. RAG-MCP enables scalable and accurate tool integration for LLMs.

摘要

由于提示膨胀和选择复杂性,大语言模型(LLMs)难以有效利用日益增多的外部工具(如模型上下文协议MCP\cite{IntroducingMCP}定义的工具)。我们提出RAG-MCP框架,该检索增强生成框架通过卸载工具发现任务来解决这一挑战。RAG-MCP在调用LLM前,先通过语义检索从外部索引中识别与查询最相关的MCP工具,仅将选定工具的描述传递给模型,从而显著减少提示长度并简化决策过程。实验(包括MCP压力测试)表明,在基准任务中RAG-MCP能大幅削减提示标记(例如减少超50%),并将工具选择准确率提升至三倍以上(43.13% vs 基线13.62%)。该框架为LLMs实现了可扩展且精准的工具集成能力。


Capability-Driven Skill Generation with LLMs: A RAG-Based Approach for Reusing Existing Libraries and Interfaces

Abstract

arXiv:2505.03295v1 Announce Type: new Abstract: Modern automation systems increasingly rely on modular architectures, with capabilities and skills as one solution approach. Capabilities define the functions of resources in a machine-readable form and skills provide the concrete implementations that realize those capabilities. However, the development of a skill implementation conforming to a corresponding capability remains a time-consuming and challenging task. In this paper, we present a method that treats capabilities as contracts for skill implementations and leverages large language models to generate executable code based on natural language user input. A key feature of our approach is the integration of existing software libraries and interface technologies, enabling the generation of skill implementations across different target languages. We introduce a framework that allows users to incorporate their own libraries and resource interfaces into the code generation process through a retrieval-augmented generation architecture. The proposed method is evaluated using an autonomous mobile robot controlled via Python and ROS 2, demonstrating the feasibility and flexibility of the approach.

摘要

现代自动化系统日益依赖模块化架构,其中能力与技能作为一种解决方案被广泛采用。能力以机器可读的形式定义资源功能,而技能则提供实现这些能力的具体实施方案。然而,开发符合特定能力要求的技能实现仍是一项耗时且具有挑战性的任务。本文提出一种将能力视为技能实现契约的方法,利用大语言模型基于自然语言用户输入生成可执行代码。该方法的突出特点是集成现有软件库与接口技术,支持跨不同目标语言的技能实现生成。我们构建了一个框架,允许用户通过检索增强生成架构,将自有程序库和资源接口纳入代码生成流程。通过基于Python和ROS 2控制的自主移动机器人进行实验评估,验证了所提方法的可行性和灵活性。


Artificial Behavior Intelligence: Technology, Challenges, and Future Directions

Abstract

arXiv:2505.03315v1 Announce Type: new Abstract: Understanding and predicting human behavior has emerged as a core capability in various AI application domains such as autonomous driving, smart healthcare, surveillance systems, and social robotics. This paper defines the technical framework of Artificial Behavior Intelligence (ABI), which comprehensively analyzes and interprets human posture, facial expressions, emotions, behavioral sequences, and contextual cues. It details the essential components of ABI, including pose estimation, face and emotion recognition, sequential behavior analysis, and context-aware modeling. Furthermore, we highlight the transformative potential of recent advances in large-scale pretrained models, such as large language models (LLMs), vision foundation models, and multimodal integration models, in significantly improving the accuracy and interpretability of behavior recognition. Our research team has a strong interest in the ABI domain and is actively conducting research, particularly focusing on the development of intelligent lightweight models capable of efficiently inferring complex human behaviors. This paper identifies several technical challenges that must be addressed to deploy ABI in real-world applications including learning behavioral intelligence from limited data, quantifying uncertainty in complex behavior prediction, and optimizing model structures for low-power, real-time inference. To tackle these challenges, our team is exploring various optimization strategies including lightweight transformers, graph-based recognition architectures, energy-aware loss functions, and multimodal knowledge distillation, while validating their applicability in real-time environments.

摘要

理解和预测人类行为已成为自动驾驶、智能医疗、监控系统和社交机器人等多种人工智能应用领域的核心能力。本文界定了行为人工智能(ABI)的技术框架,该框架综合分析并解读人体姿态、面部表情、情感状态、行为序列及情境线索。系统阐述了ABI的关键技术组件,包括姿态估计、面部与情绪识别、序列行为分析以及情境感知建模。特别强调了大模型技术(如大语言模型、视觉基础模型和多模态融合模型)的最新进展对显著提升行为识别准确性与可解释性的变革性潜力。本团队在ABI领域持续开展深入研究,重点开发能够高效推断复杂人类行为的智能轻量化模型。同时指出实际应用部署中亟待解决的技术挑战,包括有限数据下的行为智能学习、复杂行为预测的不确定性量化,以及面向低功耗实时推理的模型结构优化。针对这些挑战,团队正在探索轻量化Transformer、基于图结构的识别架构、能量感知损失函数和多模态知识蒸馏等优化策略,并验证其在实时环境中的适用性。


AI-Driven Scholarly Peer Review via Persistent Workflow Prompting, Meta-Prompting, and Meta-Reasoning

Abstract

arXiv:2505.03332v1 Announce Type: new Abstract: Critical peer review of scientific manuscripts presents a significant challenge for Large Language Models (LLMs), partly due to data limitations and the complexity of expert reasoning. This report introduces Persistent Workflow Prompting (PWP), a potentially broadly applicable prompt engineering methodology designed to bridge this gap using standard LLM chat interfaces (zero-code, no APIs). We present a proof-of-concept PWP prompt for the critical analysis of experimental chemistry manuscripts, featuring a hierarchical, modular architecture (structured via Markdown) that defines detailed analysis workflows. We develop this PWP prompt through iterative application of meta-prompting techniques and meta-reasoning aimed at systematically codifying expert review workflows, including tacit knowledge. Submitted once at the start of a session, this PWP prompt equips the LLM with persistent workflows triggered by subsequent queries, guiding modern reasoning LLMs through systematic, multimodal evaluations. Demonstrations show the PWP-guided LLM identifying major methodological flaws in a test case while mitigating LLM input bias and performing complex tasks, including distinguishing claims from evidence, integrating text/photo/figure analysis to infer parameters, executing quantitative feasibility checks, comparing estimates against claims, and assessing a priori plausibility. To ensure transparency and facilitate replication, we provide full prompts, detailed demonstration analyses, and logs of interactive chats as supplementary resources. Beyond the specific application, this work offers insights into the meta-development process itself, highlighting the potential of PWP, informed by detailed workflow formalization, to enable sophisticated analysis using readily available LLMs for complex scientific tasks.

摘要

科学手稿的批判性同行评审对大型语言模型(LLMs)构成重大挑战,部分源于数据限制和专家推理的复杂性。本报告提出持续工作流提示法(PWP),这是一种可能具有广泛适用性的提示工程方法,旨在通过标准LLM聊天界面(零代码、无API)来弥合这一差距。我们展示了一个用于实验化学手稿批判性分析的概念验证PWP提示,其采用分层模块化架构(通过Markdown结构化),可定义详细的分析工作流。该PWP提示通过迭代应用元提示技术和元推理开发而成,旨在系统化编码专家评审工作流(包括隐性知识)。在会话开始时提交一次该PWP提示,即可为LLM配备由后续查询触发的持续工作流,引导现代推理型LLMs完成系统的多模态评估。演示表明,PWP引导的LLM在测试案例中能识别主要方法缺陷,同时缓解LLM输入偏差,并执行复杂任务,包括区分主张与证据、整合文本/照片/图表分析以推断参数、执行定量可行性检查、将估算值与主张对比,以及评估先验合理性。为确保透明度并便于复现,我们提供了完整提示、详细演示分析记录和交互式聊天日志作为补充资源。除具体应用外,本研究还揭示了元开发过程本身的洞见,凸显了PWP(基于详细工作流形式化)的潜力,即利用现成LLMs实现复杂科学任务的精密分析。


Validating the Effectiveness of a Large Language Model-based Approach for Identifying Children's Development across Various Free Play Settings in Kindergarten

Abstract

arXiv:2505.03369v1 Announce Type: new Abstract: Free play is a fundamental aspect of early childhood education, supporting children's cognitive, social, emotional, and motor development. However, assessing children's development during free play poses significant challenges due to the unstructured and spontaneous nature of the activity. Traditional assessment methods often rely on direct observations by teachers, parents, or researchers, which may fail to capture comprehensive insights from free play and provide timely feedback to educators. This study proposes an innovative approach combining Large Language Models (LLMs) with learning analytics to analyze children's self-narratives of their play experiences. The LLM identifies developmental abilities, while performance scores across different play settings are calculated using learning analytics techniques. We collected 2,224 play narratives from 29 children in a kindergarten, covering four distinct play areas over one semester. According to the evaluation results from eight professionals, the LLM-based approach achieved high accuracy in identifying cognitive, motor, and social abilities, with accuracy exceeding 90% in most domains. Moreover, significant differences in developmental outcomes were observed across play settings, highlighting each area's unique contributions to specific abilities. These findings confirm that the proposed approach is effective in identifying children's development across various free play settings. This study demonstrates the potential of integrating LLMs and learning analytics to provide child-centered insights into developmental trajectories, offering educators valuable data to support personalized learning and enhance early childhood education practices.

摘要

自由游戏是幼儿教育的重要组成部分,对儿童的认知、社交、情感和运动发展具有促进作用。然而,由于该活动具有非结构化和自发性的特点,在自由游戏中评估儿童发展面临重大挑战。传统评估方法通常依赖教师、家长或研究者的直接观察,这种方法可能无法全面捕捉自由游戏的深层价值,难以为教育者提供及时反馈。本研究提出一种创新方法,将大语言模型(LLMs)与学习分析技术相结合,通过分析儿童对游戏体验的自我叙述来评估发展水平。大语言模型负责识别发展能力,而不同游戏场景下的表现评分则通过学习分析技术计算。我们在某幼儿园一个学期内收集了29名儿童在四个不同游戏区域的2,224条游戏叙事。根据八位专业人员的评估结果,基于大语言模型的方法在识别认知、运动和社交能力方面具有较高准确率,多数领域准确率超过90%。此外,不同游戏场景下的发展成果存在显著差异,凸显了各区域对特定能力的独特促进作用。这些发现证实,所提出的方法能有效识别儿童在各种自由游戏场景中的发展状况。本研究展示了结合大语言模型与学习分析技术的潜力,可为发展轨迹提供以儿童为中心的洞察,为教育者支持个性化学习和改进幼儿教育实践提供有价值的数据支持。


Procedural Memory Is Not All You Need: Bridging Cognitive Gaps in LLM-Based Agents

Abstract

arXiv:2505.03434v1 Announce Type: new Abstract: Large Language Models (LLMs) represent a landmark achievement in Artificial Intelligence (AI), demonstrating unprecedented proficiency in procedural tasks such as text generation, code completion, and conversational coherence. These capabilities stem from their architecture, which mirrors human procedural memory -- the brain's ability to automate repetitive, pattern-driven tasks through practice. However, as LLMs are increasingly deployed in real-world applications, it becomes impossible to ignore their limitations operating in complex, unpredictable environments. This paper argues that LLMs, while transformative, are fundamentally constrained by their reliance on procedural memory. To create agents capable of navigating ``wicked'' learning environments -- where rules shift, feedback is ambiguous, and novelty is the norm -- we must augment LLMs with semantic memory and associative learning systems. By adopting a modular architecture that decouples these cognitive functions, we can bridge the gap between narrow procedural expertise and the adaptive intelligence required for real-world problem-solving.

摘要

大语言模型(LLMs)是人工智能(AI)领域的里程碑式成就,在文本生成、代码补全和会话连贯性等程序性任务中展现出前所未有的熟练度。这些能力源于其架构设计——它模拟了人类程序性记忆(即大脑通过练习自动化重复性、模式驱动任务的机制)。然而,随着LLMs在现实应用中的广泛部署,其在复杂不可预测环境中的运行局限性日益凸显。本文指出:尽管具有变革性,LLMs本质上受限于对程序性记忆的依赖。要构建能够驾驭"恶性"学习环境(规则多变、反馈模糊、新颖性成为常态)的智能体,必须通过语义记忆和联想学习系统增强LLMs。采用解耦这些认知功能的模块化架构,方能弥合狭窄的程序性专业能力与现实问题求解所需的适应性智能之间的鸿沟。


The Steganographic Potentials of Language Models

Abstract

arXiv:2505.03439v1 Announce Type: new Abstract: The potential for large language models (LLMs) to hide messages within plain text (steganography) poses a challenge to detection and thwarting of unaligned AI agents, and undermines faithfulness of LLMs reasoning. We explore the steganographic capabilities of LLMs fine-tuned via reinforcement learning (RL) to: (1) develop covert encoding schemes, (2) engage in steganography when prompted, and (3) utilize steganography in realistic scenarios where hidden reasoning is likely, but not prompted. In these scenarios, we detect the intention of LLMs to hide their reasoning as well as their steganography performance. Our findings in the fine-tuning experiments as well as in behavioral non fine-tuning evaluations reveal that while current models exhibit rudimentary steganographic abilities in terms of security and capacity, explicit algorithmic guidance markedly enhances their capacity for information concealment.

摘要

大语言模型(LLMs)在纯文本中隐藏信息的潜力(隐写术)对检测和阻止未对齐AI代理提出了挑战,并削弱了LLMs推理的可信度。我们通过强化学习(RL)微调的LLMs探索其隐写能力,旨在:(1)开发隐蔽编码方案;(2)在提示时进行隐写操作;(3)在未明确提示但可能隐藏推理的现实场景中应用隐写术。在这些场景中,我们检测了LLMs隐藏推理的意图及其隐写性能。微调实验和行为非微调评估的结果表明,当前模型在安全性和容量方面仅表现出初级的隐写能力,而显式的算法指导能显著提升其信息隐藏能力。


am-ELO: A Stable Framework for Arena-based LLM Evaluation

Abstract

arXiv:2505.03475v1 Announce Type: new Abstract: Arena-based evaluation is a fundamental yet significant evaluation paradigm for modern AI models, especially large language models (LLMs). Existing framework based on ELO rating system suffers from the inevitable instability problem due to ranking inconsistency and the lack of attention to the varying abilities of annotators. In this paper, we introduce a novel stable arena framework to address these issues by enhancing the ELO Rating System. Specifically, we replace the iterative update method with a Maximum Likelihood Estimation (MLE) approach, m-ELO, and provide theoretical proof of the consistency and stability of the MLE approach for model ranking. Additionally, we proposed the am-ELO, which modify the Elo Rating's probability function to incorporate annotator abilities, enabling the simultaneous estimation of model scores and annotator reliability. Experiments demonstrate that this method ensures stability, proving that this framework offers a more robust, accurate, and stable evaluation method for LLMs.

摘要

基于竞技场的评估是现代人工智能模型(尤其是大语言模型)基础而重要的评估范式。现有基于ELO评分体系的框架存在两个固有缺陷:排名不一致导致的不可避免的稳定性问题,以及对标注者能力差异的忽视。本文提出一种新型稳定竞技场框架,通过增强ELO评分系统来解决这些问题。具体而言,我们采用最大似然估计方法(m-ELO)替代迭代更新机制,并从理论上证明了该方法的排名一致性与稳定性。进一步,我们提出am-ELO方法,通过修改ELO评分的概率函数来整合标注者能力参数,实现模型得分与标注者可靠性的同步估计。实验表明,该方法能有效保证稳定性,证实该框架为大语言模型提供了更鲁棒、精确且稳定的评估方案。


STORY2GAME: Generating (Almost) Everything in an Interactive Fiction Game

Abstract

arXiv:2505.03547v1 Announce Type: new Abstract: We introduce STORY2GAME, a novel approach to using Large Language Models to generate text-based interactive fiction games that starts by generating a story, populates the world, and builds the code for actions in a game engine that enables the story to play out interactively. Whereas a given set of hard-coded actions can artificially constrain story generation, the ability to generate actions means the story generation process can be more open-ended but still allow for experiences that are grounded in a game state. The key to successful action generation is to use LLM-generated preconditions and effects of actions in the stories as guides for what aspects of the game state must be tracked and changed by the game engine when a player performs an action. We also introduce a technique for dynamically generating new actions to accommodate the player's desire to perform actions that they think of that are not part of the story. Dynamic action generation may require on-the-fly updates to the game engine's state representation and revision of previously generated actions. We evaluate the success rate of action code generation with respect to whether a player can interactively play through the entire generated story.

摘要

我们提出STORY2GAME这一创新方法,利用大型语言模型生成基于文本的交互式虚构游戏。该方法首先生成故事框架,继而填充游戏世界内容,最终构建游戏引擎中的动作代码,使故事能够以交互形式展开。传统硬编码动作集可能人为限制故事生成,而动态生成动作的能力使得故事创作过程更具开放性,同时仍能确保游戏体验基于明确的状态机制。成功实现动作生成的关键在于:利用语言模型生成故事中动作的前置条件与效果,作为游戏引擎追踪和修改游戏状态的依据。我们还提出动态生成新动作的技术,以适应用户尝试执行故事预设外动作的需求。这种动态生成可能需要对游戏引擎状态表示进行实时更新,并对已生成动作进行修订。我们通过评估玩家能否完整交互体验生成故事,来检验动作代码生成的成功率。


A Hashgraph-Inspired Consensus Mechanism for Reliable Multi-Model Reasoning

Abstract

arXiv:2505.03553v1 Announce Type: new Abstract: Inconsistent outputs and hallucinations from large language models (LLMs) are major obstacles to reliable AI systems. When different proprietary reasoning models (RMs), such as those by OpenAI, Google, Anthropic, DeepSeek, and xAI, are given the same complex request, they often produce divergent results due to variations in training and inference. This paper proposes a novel consensus mechanism, inspired by distributed ledger technology, to validate and converge these outputs, treating each RM as a black-box peer. Building on the Hashgraph consensus algorithm, our approach employs gossip-about-gossip communication and virtual voting to achieve agreement among an ensemble of RMs. We present an architectural design for a prototype system in which RMs iteratively exchange and update their answers, using information from each round to improve accuracy and confidence in subsequent rounds. This approach goes beyond simple majority voting by incorporating the knowledge and cross-verification content of every model. We justify the feasibility of this Hashgraph-inspired consensus for AI ensembles and outline its advantages over traditional ensembling techniques in reducing nonfactual outputs. Preliminary considerations for implementation, evaluation criteria for convergence and accuracy, and potential challenges are discussed. The proposed mechanism demonstrates a promising direction for multi-agent AI systems to self-validate and deliver high-fidelity responses in complex tasks.

摘要

大型语言模型(LLMs)输出的不一致性和幻觉效应是构建可靠人工智能系统的主要障碍。当不同的专有推理模型(RMs,如OpenAI、Google、Anthropic、DeepSeek和xAI的模型)接收相同复杂请求时,由于训练和推理过程的差异,它们往往会产生分歧结果。本文受分布式账本技术启发,提出一种新颖的共识机制,将每个RM视为黑箱节点进行输出验证与收敛。基于Hashgraph共识算法,我们的方法采用"八卦传播"通信机制和虚拟投票,在RM集合中达成共识。我们设计了一个原型系统架构,其中RMs通过多轮迭代交换和更新答案,利用每轮信息提升后续轮次的准确性与置信度。该方法超越简单多数投票机制,整合了每个模型的知识与交叉验证内容。我们论证了这种Hashgraph启发的AI集合共识的可行性,并阐明其在减少非事实性输出方面相对于传统集成技术的优势。文中讨论了实施方案的初步考量、收敛性与准确性的评估标准以及潜在挑战。该机制为多智能体AI系统在复杂任务中实现自我验证与高保真响应提供了有前景的研究方向。


Graph Drawing for LLMs: An Empirical Evaluation

Abstract

arXiv:2505.03678v1 Announce Type: new Abstract: Our work contributes to the fast-growing literature on the use of Large Language Models (LLMs) to perform graph-related tasks. In particular, we focus on usage scenarios that rely on the visual modality, feeding the model with a drawing of the graph under analysis. We investigate how the model's performance is affected by the chosen layout paradigm, the aesthetics of the drawing, and the prompting technique used for the queries. We formulate three corresponding research questions and present the results of a thorough experimental analysis. Our findings reveal that choosing the right layout paradigm and optimizing the readability of the input drawing from a human perspective can significantly improve the performance of the model on the given task. Moreover, selecting the most effective prompting technique is a challenging yet crucial task for achieving optimal performance.

摘要

我们的研究为快速增长的关于利用大语言模型(LLMs)执行图相关任务的文献提供了新贡献。我们特别关注依赖视觉模态的使用场景,即向模型输入待分析图形的绘制图像。通过系统研究模型性能受布局范式选择、绘图美学效果以及查询提示技术的影响,我们提出了三个相应研究问题,并呈现了全面实验分析结果。研究发现,从人类视角选择恰当的布局范式并优化输入图形的可读性,能显著提升模型在给定任务中的表现。此外,选择最有效的提示技术虽具挑战性,但对实现最优性能至关重要。


Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models

Abstract

arXiv:2505.02847v1 Announce Type: cross Abstract: Assessing how well a large language model (LLM) understands human, rather than merely text, remains an open challenge. To bridge the gap, we introduce Sentient Agent as a Judge (SAGE), an automated evaluation framework that measures an LLM's higher-order social cognition. SAGE instantiates a Sentient Agent that simulates human-like emotional changes and inner thoughts during interaction, providing a more realistic evaluation of the tested model in multi-turn conversations. At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts. Experiments on 100 supportive-dialogue scenarios show that the final Sentient emotion score correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings and utterance-level empathy metrics, validating psychological fidelity. We also build a public Sentient Leaderboard covering 18 commercial and open-source models that uncovers substantial gaps (up to 4x) between frontier systems (GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in conventional leaderboards (e.g., Arena). SAGE thus provides a principled, scalable and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.

摘要

评估大型语言模型(LLM)对人类(而非仅对文本)的理解程度仍是一个开放性挑战。为弥合这一鸿沟,我们提出"具身智能体作为评判者"(SAGE)——一种通过模拟高阶社会认知来评估LLM的自动化框架。SAGE实例化了一个具身智能体,该智能体在交互过程中模拟类人情感变化与内心活动,从而为多轮对话中的被测模型提供更真实的评估。在每轮对话中,智能体会推理:(i)其情感如何变化,(ii)当前感受如何,以及(iii)应如何回应,由此生成数值化的情感轨迹与可解释的内心独白。在100个支持性对话场景中的实验表明,最终的情感评分与巴雷特-伦纳德关系量表(BLRI)评分及语句级共情指标高度相关,验证了其心理真实性。我们还建立了覆盖18个商业与开源模型的公共"具身智能体排行榜",揭示了前沿系统(GPT-4o-Latest、Gemini2.5-Pro)与早期基线模型之间高达4倍的显著差距,这种差距在传统排行榜(如Arena)中未被体现。因此,SAGE为追踪语言智能体向真正具备共情能力与社会适应性的发展进程,提供了原则性、可扩展且可解释的评估工具。


Aligning Large Language Models with Healthcare Stakeholders: A Pathway to Trustworthy AI Integration

Abstract

arXiv:2505.02848v1 Announce Type: cross Abstract: The wide exploration of large language models (LLMs) raises the awareness of alignment between healthcare stakeholder preferences and model outputs. This alignment becomes a crucial foundation to empower the healthcare workflow effectively, safely, and responsibly. Yet the varying behaviors of LLMs may not always match with healthcare stakeholders' knowledge, demands, and values. To enable a human-AI alignment, healthcare stakeholders will need to perform essential roles in guiding and enhancing the performance of LLMs. Human professionals must participate in the entire life cycle of adopting LLM in healthcare, including training data curation, model training, and inference. In this review, we discuss the approaches, tools, and applications of alignments between healthcare stakeholders and LLMs. We demonstrate that LLMs can better follow human values by properly enhancing healthcare knowledge integration, task understanding, and human guidance. We provide outlooks on enhancing the alignment between humans and LLMs to build trustworthy real-world healthcare applications.

摘要

大型语言模型(LLMs)的广泛探索引发了医疗健康领域利益相关者偏好与模型输出之间对齐问题的关注。这种对齐成为有效、安全、负责任地赋能医疗工作流程的关键基础。然而,语言模型的不同行为可能并不总是符合医疗健康利益相关者的知识、需求和价值观。为实现人机对齐,医疗健康利益相关者需在引导和提升语言模型性能方面发挥核心作用。人类专业人员必须参与医疗领域应用语言模型的全生命周期,包括训练数据筛选、模型训练和推理。本文综述了医疗健康利益相关者与语言模型对齐的方法、工具及应用。我们证明,通过合理增强医疗知识整合、任务理解和人类引导,语言模型能更好地遵循人类价值观。最后展望了加强人机对齐以构建可信赖的真实世界医疗应用的前景。


Enhancing tutoring systems by leveraging tailored promptings and domain knowledge with Large Language Models

Abstract

arXiv:2505.02849v1 Announce Type: cross Abstract: Recent advancements in artificial intelligence (AI) and machine learning have reignited interest in their impact on Computer-based Learning (CBL). AI-driven tools like ChatGPT and Intelligent Tutoring Systems (ITS) have enhanced learning experiences through personalisation and flexibility. ITSs can adapt to individual learning needs and provide customised feedback based on a student's performance, cognitive state, and learning path. Despite these advances, challenges remain in accommodating diverse learning styles and delivering real-time, context-aware feedback. Our research aims to address these gaps by integrating skill-aligned feedback via Retrieval Augmented Generation (RAG) into prompt engineering for Large Language Models (LLMs) and developing an application to enhance learning through personalised tutoring in a computer science programming context. The pilot study evaluated a proposed system using three quantitative metrics: readability score, response time, and feedback depth, across three programming tasks of varying complexity. The system successfully sorted simulated students into three skill-level categories and provided context-aware feedback. This targeted approach demonstrated better effectiveness and adaptability compared to general methods.

摘要

人工智能(AI)与机器学习的最新进展重新激发了人们对其在计算机辅助学习(CBL)中影响的关注。以ChatGPT和智能导学系统(ITS)为代表的AI驱动工具,通过个性化和灵活性提升了学习体验。ITS能适应个体学习需求,并根据学生的表现、认知状态及学习路径提供定制化反馈。尽管取得这些进展,在适应多样化学习风格及提供实时情境感知反馈方面仍存在挑战。本研究旨在通过将基于检索增强生成(RAG)的技能匹配反馈整合至大语言模型(LLM)的提示工程,并开发一个在计算机科学编程场景中实现个性化导学的应用程序,以解决上述问题。试点研究采用三项定量指标(可读性评分、响应时间和反馈深度),对三个不同复杂度的编程任务进行了系统评估。该系统成功将模拟学生归入三个技能等级类别,并提供情境感知反馈。相较于通用方法,这种定向策略展现出更优的效能与适应性。


Harnessing Structured Knowledge: A Concept Map-Based Approach for High-Quality Multiple Choice Question Generation with Effective Distractors

Abstract

arXiv:2505.02850v1 Announce Type: cross Abstract: Generating high-quality MCQs, especially those targeting diverse cognitive levels and incorporating common misconceptions into distractor design, is time-consuming and expertise-intensive, making manual creation impractical at scale. Current automated approaches typically generate questions at lower cognitive levels and fail to incorporate domain-specific misconceptions. This paper presents a hierarchical concept map-based framework that provides structured knowledge to guide LLMs in generating MCQs with distractors. We chose high-school physics as our test domain and began by developing a hierarchical concept map covering major Physics topics and their interconnections with an efficient database design. Next, through an automated pipeline, topic-relevant sections of these concept maps are retrieved to serve as a structured context for the LLM to generate questions and distractors that specifically target common misconceptions. Lastly, an automated validation is completed to ensure that the generated MCQs meet the requirements provided. We evaluate our framework against two baseline approaches: a base LLM and a RAG-based generation. We conducted expert evaluations and student assessments of the generated MCQs. Expert evaluation shows that our method significantly outperforms the baseline approaches, achieving a success rate of 75.20% in meeting all quality criteria compared to approximately 37% for both baseline methods. Student assessment data reveal that our concept map-driven approach achieved a significantly lower guess success rate of 28.05% compared to 37.10% for the baselines, indicating a more effective assessment of conceptual understanding. The results demonstrate that our concept map-based approach enables robust assessment across cognitive levels and instant identification of conceptual gaps, facilitating faster feedback loops and targeted interventions at scale.

摘要

生成高质量的多选题(MCQs),尤其是针对不同认知水平并将常见错误概念融入干扰项设计的题目,耗时且需要专业知识,使得大规模人工创作不切实际。当前自动化方法通常只能生成较低认知水平的问题,且无法融入领域特定的错误概念。本文提出了一种基于分层概念图的框架,通过结构化知识指导大语言模型(LLMs)生成含干扰项的MCQs。我们选择高中物理作为测试领域,首先开发了一个覆盖主要物理主题及其相互关联的分层概念图,并采用高效的数据库设计。接着通过自动化流程检索这些概念图中与主题相关的部分,作为结构化上下文供LLM生成专门针对常见错误概念的问题和干扰项。最后通过自动验证确保生成的MCQs符合要求。我们将该框架与两种基线方法(基础LLM和基于RAG的生成)进行比较,对生成的MCQs进行了专家评估和学生测试。专家评估表明,我们的方法显著优于基线方法,在满足所有质量标准方面达到75.20%的成功率,而两种基线方法仅为约37%。学生测试数据显示,基于概念图的方法猜中率显著降低至28.05%,而基线方法为37.10%,表明其对概念理解的评估更为有效。结果表明,基于概念图的方法能够实现跨认知水平的稳健评估,即时识别概念缺口,从而促进快速反馈循环和大规模针对性干预。


30DayGen: Leveraging LLMs to Create a Content Corpus for Habit Formation

Abstract

arXiv:2505.02851v1 Announce Type: cross Abstract: In this paper, we present 30 Day Me, a habit formation application that leverages Large Language Models (LLMs) to help users break down their goals into manageable, actionable steps and track their progress. Central to the app is the 30DAYGEN system, which generates 3,531 unique 30-day challenges sourced from over 15K webpages, and enables runtime search of challenge ideas aligned with user-defined goals. We showcase how LLMs can be harnessed to rapidly construct domain specific content corpora for behavioral and educational purposes, and propose a practical pipeline that incorporates effective LLM enhanced approaches for content generation and semantic deduplication.

摘要

本文介绍了"30 Day Me"——一款基于大语言模型(LLM)的习惯养成应用程序,该应用通过将用户目标分解为可管理的具体步骤并追踪进展来帮助用户。其核心是30DAYGEN系统,该系统从超过1.5万个网页中提取生成了3,531个独特的30天挑战任务,并支持运行时搜索与用户自定义目标相匹配的挑战方案。我们展示了如何利用LLM快速构建面向行为科学与教育领域的专业内容语料库,并提出了一套实用流程,该流程整合了LLM增强的内容生成与语义去重等高效方法。


Ensuring Reproducibility in Generative AI Systems for General Use Cases: A Framework for Regression Testing and Open Datasets

Abstract

arXiv:2505.02854v1 Announce Type: cross Abstract: Reproducibility and reliability remain pressing challenges for generative AI systems whose behavior can drift with each model update or prompt revision. We introduce GPR-bench, a lightweight, extensible benchmark that operationalizes regression testing for general purpose use cases. GPR-bench couples an open, bilingual (English and Japanese) dataset covering eight task categories (e.g., text generation, code generation, and information retrieval) and 10 scenarios in each task categories (80 total test cases for each language) with an automated evaluation pipeline that employs "LLM-as-a-Judge" scoring of correctness and conciseness. Experiments across three recent model versions - gpt-4o-mini, o3-mini, and o4-mini - and two prompt configurations (default versus concise-writing instruction) reveal heterogeneous quality. Our results show that newer models generally improve correctness, but the differences are modest and not statistically significant, suggesting that GPR-bench may not be sufficiently challenging to differentiate between recent model versions. In contrast, the concise-writing instruction significantly enhances conciseness (+12.37 pp, Mann-Whitney U test: p < 0.001, effect size r = 0.2995) with minimal degradations on accuracy (-1.7 pp), demonstrating the effectiveness of prompt engineering. Released under the MIT License, GPR- bench lowers the barrier to initiating reproducibility monitoring and provides a foundation for community-driven extensions, while also raising important considerations about benchmark design for rapidly evolving language models.

摘要

可复现性和可靠性仍是生成式AI系统面临的紧迫挑战,这些系统的行为可能随着每次模型更新或提示词修改而发生漂移。我们推出GPR-bench这一轻量级、可扩展的基准测试工具,为通用场景实现回归测试操作化。该工具包含一个开放的双语(英语和日语)数据集,涵盖8个任务类别(如文本生成、代码生成和信息检索)及每个类别下的10个场景(每种语言共80个测试用例),并配备采用"LLM-as-a-Judge"机制进行正确性与简洁性评分的自动化评估流程。通过对gpt-4o-mini、o3-mini和o4-mini三个近期模型版本及两种提示配置(默认模式与简洁写作指令)的实验,我们发现质量表现存在异质性。结果表明新版模型通常能提升正确性,但改进幅度有限且无统计学显著性,这意味着GPR-bench可能不足以区分近期模型版本。相比之下,简洁写作指令显著提升了表达简洁性(+12.37个百分点,Mann-Whitney U检验:p < 0.001,效应量r=0.2995),而准确性仅轻微下降(-1.7个百分点),证实了提示词工程的有效性。采用MIT许可证发布的GPR-bench降低了启动可复现性监测的门槛,为社区驱动扩展奠定了基础,同时也对快速演进的语言模型的基准测试设计提出了重要思考。


Enhancing ML Model Interpretability: Leveraging Fine-Tuned Large Language Models for Better Understanding of AI

Abstract

arXiv:2505.02859v1 Announce Type: cross Abstract: Across various sectors applications of eXplainableAI (XAI) gained momentum as the increasing black-boxedness of prevailing Machine Learning (ML) models became apparent. In parallel, Large Language Models (LLMs) significantly developed in their abilities to understand human language and complex patterns. By combining both, this paper presents a novel reference architecture for the interpretation of XAI through an interactive chatbot powered by a fine-tuned LLM. We instantiate the reference architecture in the context of State-of-Health (SoH) prediction for batteries and validate its design in multiple evaluation and demonstration rounds. The evaluation indicates that the implemented prototype enhances the human interpretability of ML, especially for users with less experience with XAI.

摘要

随着主流机器学习模型日益显现的"黑箱"特性,可解释人工智能(XAI)在各行业的应用加速发展。与此同时,大型语言模型(LLM)在理解人类语言和复杂模式方面的能力显著提升。本研究通过结合两者优势,提出一种由精调LLM驱动的交互式聊天机器人来解读XAI的新型参考架构。我们在电池健康状态(SoH)预测场景中实例化该架构,并通过多轮评估与演示验证其设计。评估表明,所实现的原型系统增强了机器学习模型的人类可解释性,尤其对XAI经验较少的用户效果显著。


Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs

Abstract

arXiv:2505.02862v1 Announce Type: cross Abstract: Despite the remarkable performance of Large Language Models (LLMs), they remain vulnerable to jailbreak attacks, which can compromise their safety mechanisms. Existing studies often rely on brute-force optimization or manual design, failing to uncover potential risks in real-world scenarios. To address this, we propose a novel jailbreak attack framework, ICRT, inspired by heuristics and biases in human cognition. Leveraging the simplicity effect, we employ cognitive decomposition to reduce the complexity of malicious prompts. Simultaneously, relevance bias is utilized to reorganize prompts, enhancing semantic alignment and inducing harmful outputs effectively. Furthermore, we introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm by employing ranking aggregation methods such as Elo, HodgeRank, and Rank Centrality to comprehensively quantify the harmfulness of generated content. Experimental results show that our approach consistently bypasses mainstream LLMs' safety mechanisms and generates high-risk content, providing insights into jailbreak attack risks and contributing to stronger defense strategies.

摘要

尽管大语言模型(LLMs)表现出卓越的性能,但其仍易受到越狱攻击的影响,这些攻击可能破坏其安全机制。现有研究多依赖暴力优化或人工设计,难以揭示现实场景中的潜在风险。为此,我们提出了一种新颖的越狱攻击框架ICRT,其灵感源自人类认知中的启发式与偏差。通过利用简洁效应,我们采用认知分解来降低恶意提示的复杂性;同时运用关联偏差重组提示,增强语义对齐并有效诱导有害输出。此外,我们引入了一种基于排序的危害性评估指标,该方法采用Elo、HodgeRank和Rank Centrality等排序聚合算法,突破了传统非成即败的二元评估范式,能全面量化生成内容的危害程度。实验结果表明,我们的方法能持续绕过主流LLMs的安全机制并生成高风险内容,这为理解越狱攻击风险提供了新视角,并为构建更强防御策略作出贡献。


Abstract

arXiv:2505.02865v1 Announce Type: cross Abstract: Tree-search-based reasoning methods have significantly enhanced the reasoning capability of large language models (LLMs) by facilitating the exploration of multiple intermediate reasoning steps, i.e., thoughts. However, these methods suffer from substantial inference latency, as they have to generate numerous reasoning thoughts, severely limiting LLM applicability. To address this challenge, we propose a novel Speculative Search (SpecSearch) framework that significantly accelerates LLM reasoning by optimizing thought generation. Specifically, SpecSearch utilizes a small model to strategically collaborate with a large model at both thought and token levels, efficiently generating high-quality reasoning thoughts. The major pillar of SpecSearch is a novel quality-preserving rejection mechanism, which effectively filters out thoughts whose quality falls below that of the large model's outputs. Moreover, we show that SpecSearch preserves comparable reasoning quality to the large model. Experiments on both the Qwen and Llama models demonstrate that SpecSearch significantly outperforms state-of-the-art approaches, achieving up to 2.12×\times speedup with comparable reasoning quality.

摘要

基于树搜索的推理方法通过探索多个中间推理步骤(即思维链),显著提升了大型语言模型(LLMs)的推理能力。然而,这些方法因需生成大量推理思维链而存在较高推理延迟,严重制约了LLMs的实际应用。为应对这一挑战,我们提出了一种新颖的推测式搜索框架(SpecSearch),通过优化思维链生成显著加速LLM推理。该框架的核心在于利用小模型与大模型在思维链和词元级别进行策略性协作,高效生成高质量推理思维链。SpecSearch的关键支撑是一种创新的质量保持拒绝机制,可有效过滤质量低于大模型输出的思维链。此外,我们证明SpecSearch能保持与大模型相当的推理质量。在Qwen和Llama模型上的实验表明,SpecSearch在保持相当推理质量的同时,最高可实现2.12倍加速,显著优于现有最优方法。


Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading

Abstract

arXiv:2505.02872v1 Announce Type: cross Abstract: When reading, we often have specific information that interests us in a text. For example, you might be reading this paper because you are curious about LLMs for eye movements in reading, the experimental design, or perhaps you only care about the question ``but does it work?''. More broadly, in daily life, people approach texts with any number of text-specific goals that guide their reading behavior. In this work, we ask, for the first time, whether open-ended reading goals can be automatically decoded from eye movements in reading. To address this question, we introduce goal classification and goal reconstruction tasks and evaluation frameworks, and use large-scale eye tracking for reading data in English with hundreds of text-specific information seeking tasks. We develop and compare several discriminative and generative multimodal LLMs that combine eye movements and text for goal classification and goal reconstruction. Our experiments show considerable success on both tasks, suggesting that LLMs can extract valuable information about the readers' text-specific goals from eye movements.

摘要

在阅读过程中,我们往往对文本中的特定信息感兴趣。例如,您阅读本文可能出于对阅读眼动大语言模型、实验设计的好奇,或仅关注"这方法有效吗"这一问题。更广泛而言,人们在日常生活中会带着各种文本特异性目标进行阅读。本研究首次探讨了能否通过眼动数据自动解码开放式阅读目标。为此,我们提出了目标分类和目标重建任务及其评估框架,并利用英语阅读的大规模眼动追踪数据(包含数百种文本特异性信息寻求任务)。我们开发并比较了多种结合眼动与文本的多模态大语言模型(包括判别式和生成式),用于目标分类与重建。实验结果表明,两类任务均取得显著成效,证明大语言模型能够从眼动数据中提取读者文本特异性目标的有价值信息。


Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

Abstract

arXiv:2505.02881v1 Announce Type: cross Abstract: The performance of large language models (LLMs) in program synthesis and mathematical reasoning is fundamentally limited by the quality of their pre-training corpora. We introduce two openly licensed datasets, released under the Llama 3.3 Community License, that significantly enhance LLM performance by systematically rewriting public data. SwallowCode (approximately 16.1 billion tokens) refines Python snippets from The-Stack-v2 through a novel four-stage pipeline: syntax validation, pylint-based style filtering, and a two-stage LLM rewriting process that enforces style conformity and transforms snippets into self-contained, algorithmically efficient examples. Unlike prior methods that rely on exclusionary filtering or limited transformations, our transform-and-retain approach upgrades low-quality code, maximizing data utility. SwallowMath (approximately 2.3 billion tokens) enhances Finemath-4+ by removing boilerplate, restoring context, and reformatting solutions into concise, step-by-step explanations. Within a fixed 50 billion token training budget, continual pre-training of Llama-3.1-8B with SwallowCode boosts pass@1 by +17.0 on HumanEval and +17.7 on HumanEval+ compared to Stack-Edu, surpassing the baseline model's code generation capabilities. Similarly, substituting SwallowMath yields +12.4 accuracy on GSM8K and +7.6 on MATH. Ablation studies confirm that each pipeline stage contributes incrementally, with rewriting delivering the largest gains. All datasets, prompts, and checkpoints are publicly available, enabling reproducible research and advancing LLM pre-training for specialized domains.

摘要

大型语言模型(LLMs)在程序合成与数学推理方面的性能根本上受限于其预训练语料的质量。我们推出两个基于Llama 3.3社区许可证公开授权的数据集,通过系统性重写公开数据显著提升LLM性能。SwallowCode(约161亿token)采用新颖的四阶段流程优化The-Stack-v2的Python代码片段:语法验证、基于pylint的风格过滤,以及两阶段LLM重写过程——强制风格统一并将片段转化为自包含的算法高效示例。不同于先前依赖排除性过滤或有限转换的方法,我们的"转换-保留"策略可升级低质量代码,最大化数据效用。SwallowMath(约23亿token)通过移除样板文本、还原上下文、将解决方案重构为简洁的逐步解释,增强了Finemath-4+数据集。在固定500亿token训练预算下,使用SwallowCode对Llama-3.1-8B进行持续预训练,相较Stack-Edu在HumanEval上pass@1提升+17.0,HumanEval+提升+17.7,超越基线模型的代码生成能力。类似地,采用SwallowMath可使GSM8K准确率提升+12.4,MATH提升+7.6。消融研究证实每个流程阶段均具有增量贡献,其中重写环节收益最大。所有数据集、提示词及检查点均已公开,可促进可复现研究并推动专业领域LLM预训练发展。


Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?

Abstract

arXiv:2505.02884v1 Announce Type: cross Abstract: Unlearning has emerged as a critical capability for large language models (LLMs) to support data privacy, regulatory compliance, and ethical AI deployment. Recent techniques often rely on obfuscation by injecting incorrect or irrelevant information to suppress knowledge. Such methods effectively constitute knowledge addition rather than true removal, often leaving models vulnerable to probing. In this paper, we formally distinguish unlearning from obfuscation and introduce a probing-based evaluation framework to assess whether existing approaches genuinely remove targeted information. Moreover, we propose DF-MCQ, a novel unlearning method that flattens the model predictive distribution over automatically generated multiple-choice questions using KL-divergence, effectively removing knowledge about target individuals and triggering appropriate refusal behaviour. Experimental results demonstrate that DF-MCQ achieves unlearning with over 90% refusal rate and a random choice-level uncertainty that is much higher than obfuscation on probing questions.

摘要

遗忘能力已成为大型语言模型(LLMs)支持数据隐私、法规遵从和伦理AI部署的关键特性。现有技术多通过注入错误或无关信息来实现知识混淆,这种方法实质上是知识添加而非真正移除,模型仍易受探测攻击。本文正式区分了遗忘与混淆的概念,并提出基于探测的评估框架以检验现有方法是否真正移除了目标信息。此外,我们提出DF-MCQ新型遗忘方法,该方法利用KL散度在自动生成多选题上平滑模型预测分布,有效移除目标个体相关知识并触发恰当的拒绝行为。实验结果表明,DF-MCQ实现超过90%的拒绝率,其探测问题上的随机选择级不确定性显著高于混淆方法。


When Your Own Output Becomes Your Training Data: Noise-to-Meaning Loops and a Formal RSI Trigger

Abstract

arXiv:2505.02888v1 Announce Type: cross Abstract: We present Noise-to-Meaning Recursive Self-Improvement (N2M-RSI), a minimal formal model showing that once an AI agent feeds its own outputs back as inputs and crosses an explicit information-integration threshold, its internal complexity will grow without bound under our assumptions. The framework unifies earlier ideas on self-prompting large language models, G"odelian self-reference, and AutoML, yet remains implementation-agnostic. The model furthermore scales naturally to interacting swarms of agents, hinting at super-linear effects once communication among instances is permitted. For safety reasons, we omit system-specific implementation details and release only a brief, model-agnostic toy prototype in Appendix C.

摘要

我们提出"噪声到意义递归自我改进"(N2M-RSI)这一最小化形式模型,证明当AI智能体将自身输出作为输入反馈并跨越显式信息整合阈值时,在本文假设条件下其内部复杂度将无限增长。该框架统一了大型语言模型自我提示、哥德尔式自指和自动机器学习等早期思想,同时保持实现方式无关性。该模型可自然扩展至交互式智能体群,暗示一旦允许实例间通信将产生超线性效应。出于安全考虑,我们省略了系统具体实现细节,仅在附录C发布了一个简短的、与模型无关的玩具原型。


The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models

Abstract

arXiv:2505.02931v1 Announce Type: cross Abstract: Automatic program repair (APR) aims to reduce the manual efforts required to identify and fix errors in source code. Before the rise of LLM-based agents, a common strategy was to increase the number of generated patches, sometimes to the thousands, to achieve better repair results on benchmarks. More recently, self-iterative capabilities enabled LLMs to refine patches over multiple rounds guided by feedback. However, literature often focuses on many iterations and disregards different numbers of outputs. We investigate an APR pipeline that balances these two approaches, the generation of multiple outputs and multiple rounds of iteration, while imposing a limit of 10 total patches per bug. We apply three SOTA instruction-tuned LLMs - DeepSeekCoder-Instruct, Codellama-Instruct, Llama3.1-Instruct - to the APR task. We further fine-tune each model on an APR dataset with three sizes (1K, 30K, 65K) and two techniques (Full Fine-Tuning and LoRA), allowing us to assess their repair capabilities on two APR benchmarks: HumanEval-Java and Defects4J. Our results show that by using only a fraction (<1%) of the fine-tuning dataset, we can achieve improvements of up to 78% in the number of plausible patches generated, challenging prior studies that reported limited gains using Full Fine-Tuning. However, we find that exceeding certain thresholds leads to diminishing outcomes, likely due to overfitting. Moreover, we show that base models greatly benefit from creating patches in an iterative fashion rather than generating them all at once. In addition, the benefit of iterative strategies becomes more pronounced in complex benchmarks. Even fine-tuned models, while benefiting less from iterations, still gain advantages, particularly on complex benchmarks. The research underscores the need for balanced APR strategies that combine multi-output generation and iterative refinement.

摘要

自动程序修复(APR)旨在减少识别和修复源代码错误所需的人工投入。在基于大语言模型(LLM)的智能体兴起之前,常见策略是通过增加生成补丁数量(有时达数千个)来提升基准测试中的修复效果。近期,自迭代能力使LLM能够在反馈指导下进行多轮补丁优化。然而现有研究多聚焦于多次迭代,却忽视了不同输出数量的影响。

本研究设计了一种平衡两种方法(多输出生成与多轮迭代)的APR流程,同时将每个错误的补丁总数限制为10个。我们将三种指令调优的先进LLM(DeepSeekCoder-Instruct、Codellama-Instruct、Llama3.1-Instruct)应用于APR任务,并采用三种规模(1K、30K、65K)和两种技术(全参数微调与LoRA)对每个模型进行APR数据集微调,从而评估其在HumanEval-Java和Defects4J两个APR基准上的修复能力。

实验表明:仅使用微调数据集的极小比例(<1%)即可将合理补丁生成数量提升达78%,这对先前全参数微调收益有限的研究结论提出了挑战。但超过特定阈值会导致收益递减,这可能是过拟合所致。此外,基础模型通过迭代生成补丁比一次性生成获益更大,且迭代策略在复杂基准测试中优势更显著。即使经过微调的模型从迭代中获益较少,仍能获得优势,尤其在复杂基准测试中。本研究揭示了结合多输出生成与迭代优化的平衡APR策略的必要性。


Generating Narrated Lecture Videos from Slides with Synchronized Highlights

Abstract

arXiv:2505.02966v1 Announce Type: cross Abstract: Turning static slides into engaging video lectures takes considerable time and effort, requiring presenters to record explanations and visually guide their audience through the material. We introduce an end-to-end system designed to automate this process entirely. Given a slide deck, this system synthesizes a video lecture featuring AI-generated narration synchronized precisely with dynamic visual highlights. These highlights automatically draw attention to the specific concept being discussed, much like an effective presenter would. The core technical contribution is a novel highlight alignment module. This module accurately maps spoken phrases to locations on a given slide using diverse strategies (e.g., Levenshtein distance, LLM-based semantic analysis) at selectable granularities (line or word level) and utilizes timestamp-providing Text-to-Speech (TTS) for timing synchronization. We demonstrate the system's effectiveness through a technical evaluation using a manually annotated slide dataset with 1000 samples, finding that LLM-based alignment achieves high location accuracy (F1 > 92%), significantly outperforming simpler methods, especially on complex, math-heavy content. Furthermore, the calculated generation cost averages under $1 per hour of video, offering potential savings of two orders of magnitude compared to conservative estimates of manual production costs. This combination of high accuracy and extremely low cost positions this approach as a practical and scalable tool for transforming static slides into effective, visually-guided video lectures.

摘要

将静态幻灯片转化为引人入胜的视频课程需要投入大量时间和精力,要求讲解者录制解说并通过视觉引导观众理解内容。我们提出了一种端到端系统,旨在完全自动化这一过程。该系统接收幻灯片文档后,可合成具有AI生成旁白的视频讲座,其动态视觉标注能精确同步突显当前讲解的概念,效果堪比优秀的人类讲解者。核心技术贡献是一个创新的标注对齐模块,该模块通过多策略(如莱文斯坦距离、基于LLM的语义分析)和可选粒度(行级或词级),将语音内容准确映射至幻灯片对应位置,并利用带时间戳的文本转语音(TTS)技术实现时序同步。基于包含1000个样本的人工标注幻灯片数据集的技术评估表明:基于LLM的对齐方法实现了高定位准确率(F1>92%),显著优于简单方法,尤其在数学密集型复杂内容上表现突出。此外,系统计算生成成本平均低于1美元/视频小时,较保守估算的手工制作成本可降低两个数量级。这种高精度与极低成本的结合,使该方法成为将静态幻灯片转化为高效视觉引导视频课程的实用、可扩展解决方案。


RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Abstract

arXiv:2505.03005v1 Announce Type: cross Abstract: We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than $2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper

摘要

我们提出大规模快速注意力蒸馏至线性注意力解码器(RADLADS)协议,该协议能够快速将softmax注意力Transformer模型转换为线性注意力解码器模型,并同步推出两种新型RWKV变体架构,以及从Qwen2.5开源模型转换而来的7B、32B和72B规模模型。我们的转换过程仅需3.5-7亿标记量,不足原始教师模型训练标记量的0.005%。转换为72B线性注意力模型的成本按当前价格计算低于2000美元,而推理质量仍接近原Transformer模型。这些模型在其规模级别的线性注意力模型中,于一组标准基准测试上实现了最先进的下游性能。我们将所有模型在Apache 2.0许可下发布于HuggingFace平台(72B模型额外受Qwen许可协议约束),模型地址:https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102,训练代码详见:https://github.com/recursal/RADLADS-paper。


Memorization or Interpolation ? Detecting LLM Memorization through Input Perturbation Analysis

Abstract

arXiv:2505.03019v1 Announce Type: cross Abstract: While Large Language Models (LLMs) achieve remarkable performance through training on massive datasets, they can exhibit concerning behaviors such as verbatim reproduction of training data rather than true generalization. This memorization phenomenon raises significant concerns about data privacy, intellectual property rights, and the reliability of model evaluations. This paper introduces PEARL, a novel approach for detecting memorization in LLMs. PEARL assesses how sensitive an LLM's performance is to input perturbations, enabling memorization detection without requiring access to the model's internals. We investigate how input perturbations affect the consistency of outputs, enabling us to distinguish between true generalization and memorization. Our findings, following extensive experiments on the Pythia open model, provide a robust framework for identifying when the model simply regurgitates learned information. Applied on the GPT 4o models, the PEARL framework not only identified cases of memorization of classic texts from the Bible or common code from HumanEval but also demonstrated that it can provide supporting evidence that some data, such as from the New York Times news articles, were likely part of the training data of a given model.

摘要

尽管大型语言模型(LLMs)通过海量数据训练取得了卓越性能,但它们可能表现出令人担忧的行为,例如逐字复现训练数据而非真正实现泛化。这种记忆现象引发了关于数据隐私、知识产权和模型评估可靠性的重大关切。本文提出PEARL这一检测LLMs记忆行为的新方法,该方法通过评估模型输出对输入扰动的敏感度实现记忆检测,且无需访问模型内部结构。我们探究输入扰动如何影响输出一致性,从而区分真实泛化与记忆行为。基于Pythia开源模型的广泛实验表明,该方法为识别模型机械复现学习内容提供了可靠框架。在GPT-4o模型上的应用显示,PEARL不仅能识别《圣经》经典文本或HumanEval常见代码的记忆现象,还可为某些数据(如《纽约时报》新闻文章)可能属于特定模型训练数据提供佐证依据。


MORE: Mobile Manipulation Rearrangement Through Grounded Language Reasoning

Abstract

arXiv:2505.03035v1 Announce Type: cross Abstract: Autonomous long-horizon mobile manipulation encompasses a multitude of challenges, including scene dynamics, unexplored areas, and error recovery. Recent works have leveraged foundation models for scene-level robotic reasoning and planning. However, the performance of these methods degrades when dealing with a large number of objects and large-scale environments. To address these limitations, we propose MORE, a novel approach for enhancing the capabilities of language models to solve zero-shot mobile manipulation planning for rearrangement tasks. MORE leverages scene graphs to represent environments, incorporates instance differentiation, and introduces an active filtering scheme that extracts task-relevant subgraphs of object and region instances. These steps yield a bounded planning problem, effectively mitigating hallucinations and improving reliability. Additionally, we introduce several enhancements that enable planning across both indoor and outdoor environments. We evaluate MORE on 81 diverse rearrangement tasks from the BEHAVIOR-1K benchmark, where it becomes the first approach to successfully solve a significant share of the benchmark, outperforming recent foundation model-based approaches. Furthermore, we demonstrate the capabilities of our approach in several complex real-world tasks, mimicking everyday activities. We make the code publicly available at https://more-model.cs.uni-freiburg.de.

摘要

自主长时程移动操纵面临诸多挑战,包括场景动态性、未探索区域及错误恢复等问题。现有研究多利用基础模型进行场景级机器人推理与规划,但这些方法在处理大规模环境及大量物体时性能显著下降。为突破这些限制,我们提出MORE方法,通过增强语言模型能力实现零样本重排任务的移动操纵规划。该方案采用场景图表征环境,融合实例区分机制,并创新性地引入主动过滤策略以提取任务相关的物体及区域实例子图。这些步骤生成有界规划问题,有效缓解幻觉现象并提升可靠性。此外,我们提出的多项增强技术实现了室内外环境的跨场景规划。在BEHAVIOR-1K基准测试的81项多样化重排任务中,MORE成为首个成功解决该基准大部分任务的方案,其表现优于近期基于基础模型的方法。我们进一步通过模拟日常活动的复杂现实任务验证了该方法的实际应用能力。项目代码已开源:https://more-model.cs.uni-freiburg.de。


Developing A Framework to Support Human Evaluation of Bias in Generated Free Response Text

Abstract

arXiv:2505.03053v1 Announce Type: cross Abstract: LLM evaluation is challenging even the case of base models. In real world deployments, evaluation is further complicated by the interplay of task specific prompts and experiential context. At scale, bias evaluation is often based on short context, fixed choice benchmarks that can be rapidly evaluated, however, these can lose validity when the LLMs' deployed context differs. Large scale human evaluation is often seen as too intractable and costly. Here we present our journey towards developing a semi-automated bias evaluation framework for free text responses that has human insights at its core. We discuss how we developed an operational definition of bias that helped us automate our pipeline and a methodology for classifying bias beyond multiple choice. We additionally comment on how human evaluation helped us uncover problematic templates in a bias benchmark.

摘要

即便针对基础模型,大语言模型(LLM)的评估也颇具挑战性。在实际部署中,任务特定提示与经验性语境的相互作用进一步增加了评估的复杂性。大规模偏差评估通常基于可快速执行的短上下文固定选项基准测试,但当LLM的部署语境发生变化时,这些测试可能失效。人类大规模评估常被认为难以实施且成本高昂。本文阐述了我们在开发以人类洞察为核心的半自动化自由文本响应偏差评估框架过程中的探索。我们讨论了如何通过制定可操作化的偏差定义实现流程自动化,并提出超越多项选择题的偏差分类方法。此外,我们还分析了人类评估如何帮助发现偏差基准测试中存在问题的模板。


Soft Best-of-n Sampling for Model Alignment

Abstract

arXiv:2505.03156v1 Announce Type: cross Abstract: Best-of-nn (BoN) sampling is a practical approach for aligning language model outputs with human preferences without expensive fine-tuning. BoN sampling is performed by generating nn responses to a prompt and then selecting the sample that maximizes a reward function. BoN yields high reward values in practice at a distortion cost, as measured by the KL-divergence between the sampled and original distribution. This distortion is coarsely controlled by varying the number of samples: larger nn yields a higher reward at a higher distortion cost. We introduce Soft Best-of-nn sampling, a generalization of BoN that allows for smooth interpolation between the original distribution and reward-maximizing distribution through a temperature parameter λ\lambda. We establish theoretical guarantees showing that Soft Best-of-nn sampling converges sharply to the optimal tilted distribution at a rate of O(1/n)O(1/n) in KL and the expected (relative) reward. For sequences of discrete outputs, we analyze an additive reward model that reveals the fundamental limitations of blockwise sampling.

摘要

最佳n选1(BoN)采样是一种无需昂贵微调即可使语言模型输出与人类偏好对齐的实用方法。该方法通过生成n个提示响应后选择能最大化奖励函数的样本来实现。实践表明BoN能以失真代价(通过采样分布与原始分布的KL散度衡量)获得高奖励值,这种失真通过改变样本数量进行粗粒度控制:更大的n会以更高的失真代价换取更高奖励。我们提出软性最佳n选1采样——BoN的广义形式,通过温度参数λ实现原始分布与奖励最大化分布之间的平滑插值。理论证明表明,软性最佳n选1采样能以O(1/n)的KL散度收敛速率和期望(相对)奖励率急剧收敛至最优倾斜分布。针对离散输出序列,我们分析了一个加性奖励模型,该模型揭示了分块采样的根本局限性。


A Trustworthy Multi-LLM Network: Challenges,Solutions, and A Use Case

Abstract

arXiv:2505.03196v1 Announce Type: cross Abstract: Large Language Models (LLMs) demonstrate strong potential across a variety of tasks in communications and networking due to their advanced reasoning capabilities. However, because different LLMs have different model structures and are trained using distinct corpora and methods, they may offer varying optimization strategies for the same network issues. Moreover, the limitations of an individual LLM's training data, aggravated by the potential maliciousness of its hosting device, can result in responses with low confidence or even bias. To address these challenges, we propose a blockchain-enabled collaborative framework that connects multiple LLMs into a Trustworthy Multi-LLM Network (MultiLLMN). This architecture enables the cooperative evaluation and selection of the most reliable and high-quality responses to complex network optimization problems. Specifically, we begin by reviewing related work and highlighting the limitations of existing LLMs in collaboration and trust, emphasizing the need for trustworthiness in LLM-based systems. We then introduce the workflow and design of the proposed Trustworthy MultiLLMN framework. Given the severity of False Base Station (FBS) attacks in B5G and 6G communication systems and the difficulty of addressing such threats through traditional modeling techniques, we present FBS defense as a case study to empirically validate the effectiveness of our approach. Finally, we outline promising future research directions in this emerging area.

摘要

大型语言模型(LLMs)凭借其先进的推理能力,在通信与网络领域的各类任务中展现出强大潜力。然而,由于不同LLMs具有差异化的模型结构,且训练语料与方法各异,它们可能针对同一网络问题提出迥异的优化策略。此外,单个LLM训练数据的局限性,叠加其宿主设备潜在的恶意性,可能导致低置信度甚至带有偏见的响应。为应对这些挑战,我们提出一种基于区块链的协同框架,将多个LLMs连接成可信多LLM网络(MultiLLMN)。该架构通过协作评估机制筛选出针对复杂网络优化问题的最可靠、高质量解决方案。具体而言,我们首先综述相关研究,指出现有LLMs在协同性与可信度方面的局限,强调基于LLM的系统必须具备可信特性。接着详细阐述所提可信MultiLLMN框架的工作流程与设计。鉴于伪基站(FBS)攻击对B5G/6G通信系统的严重危害,以及传统建模技术应对此类威胁的困难性,我们以FBS防御为案例实证验证方法的有效性。最后,展望了这一新兴领域的潜在研究方向。


DocSpiral: A Platform for Integrated Assistive Document Annotation through Human-in-the-Spiral

Abstract

arXiv:2505.03214v1 Announce Type: cross Abstract: Acquiring structured data from domain-specific, image-based documents such as scanned reports is crucial for many downstream tasks but remains challenging due to document variability. Many of these documents exist as images rather than as machine-readable text, which requires human annotation to train automated extraction systems. We present DocSpiral, the first Human-in-the-Spiral assistive document annotation platform, designed to address the challenge of extracting structured information from domain-specific, image-based document collections. Our spiral design establishes an iterative cycle in which human annotations train models that progressively require less manual intervention. DocSpiral integrates document format normalization, comprehensive annotation interfaces, evaluation metrics dashboard, and API endpoints for the development of AI / ML models into a unified workflow. Experiments demonstrate that our framework reduces annotation time by at least 41% while showing consistent performance gains across three iterations during model training. By making this annotation platform freely accessible, we aim to lower barriers to AI/ML models development in document processing, facilitating the adoption of large language models in image-based, document-intensive fields such as geoscience and healthcare. The system is freely available at: https://app.ai4wa.com. The demonstration video is available: https://app.ai4wa.com/docs/docspiral/demo.

摘要

从特定领域基于图像的文档(如扫描报告)中获取结构化数据对许多下游任务至关重要,但由于文档的多样性,这一过程仍具挑战性。此类文档多以图像形式存在而非机器可读文本,需人工标注以训练自动提取系统。我们提出DocSpiral——首个"人在回路"辅助文档标注平台,旨在解决从特定领域基于图像的文档集合中提取结构化信息的难题。该螺旋式设计建立了迭代循环机制,通过人工标注训练模型,逐步减少人工干预需求。DocSpiral将文档格式标准化、综合标注界面、评估指标仪表盘及AI/ML模型开发的API端点集成至统一工作流。实验表明,该框架在模型训练的三个迭代周期中持续提升性能,同时减少至少41%的标注时间。通过免费开放此标注平台,我们致力于降低文档处理领域AI/ML模型开发门槛,推动大语言模型在地质科学、医疗保健等基于图像的文档密集型领域的应用。


Synthline: A Product Line Approach for Synthetic Requirements Engineering Data Generation using Large Language Models

Abstract

arXiv:2505.03265v1 Announce Type: cross Abstract: While modern Requirements Engineering (RE) heavily relies on natural language processing and Machine Learning (ML) techniques, their effectiveness is limited by the scarcity of high-quality datasets. This paper introduces Synthline, a Product Line (PL) approach that leverages Large Language Models to systematically generate synthetic RE data for classification-based use cases. Through an empirical evaluation conducted in the context of using ML for the identification of requirements specification defects, we investigated both the diversity of the generated data and its utility for training downstream models. Our analysis reveals that while synthetic datasets exhibit less diversity than real data, they are good enough to serve as viable training resources. Moreover, our evaluation shows that combining synthetic and real data leads to substantial performance improvements. Specifically, hybrid approaches achieve up to 85% improvement in precision and a 2x increase in recall compared to models trained exclusively on real data. These findings demonstrate the potential of PL-based synthetic data generation to address data scarcity in RE. We make both our implementation and generated datasets publicly available to support reproducibility and advancement in the field.

摘要

尽管现代需求工程(RE)高度依赖自然语言处理和机器学习(ML)技术,但这些技术的有效性受限于高质量数据集的稀缺性。本文提出Synthline——一种基于产品线(PL)的方法,利用大语言模型系统化生成面向分类用例的合成RE数据。通过在使用ML识别需求规范缺陷的场景下进行实证评估,我们研究了生成数据的多样性及其对训练下游模型的实用性。分析表明,虽然合成数据集多样性低于真实数据,但仍足以作为有效的训练资源。此外,评估显示结合合成与真实数据能带来显著的性能提升:相较于仅使用真实数据训练的模型,混合方法在精确度上最高提升85%,召回率提高2倍。这些发现证明了基于PL的合成数据生成在解决RE数据稀缺问题上的潜力。我们公开了实现代码和生成数据集以支持该领域的可复现性与研究进展。


Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Abstract

arXiv:2505.03335v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

摘要

可验证奖励的强化学习(RLVR)通过基于结果的奖励直接学习,在增强大语言模型推理能力方面展现出潜力。近期零样本环境下的RLVR研究避免了对推理过程标注的监督,但仍依赖于人工整理的问答集合进行训练。高质量人类生成样本的稀缺性引发了对其长期可扩展性的担忧,这种依赖人类监督的挑战在语言模型预训练领域已显而易见。此外,在AI超越人类智能的假设未来中,人类提供的任务可能对超智能系统的学习潜力有限。为解决这些问题,我们提出名为"绝对零样本"的新RLVR范式,其中单一模型通过自主提出最大化学习进度的任务并解决问题来提升推理能力,且完全不依赖外部数据。基于此范式,我们开发了绝对零样本推理器(AZR),该系统通过代码执行器验证自主提出的代码推理任务及答案,作为可验证奖励的统一来源,引导开放而 grounded 的学习。尽管完全未使用外部数据训练,AZR在编程和数学推理任务上整体达到SOTA性能,优于依赖数万领域内人工标注样本的现有零样本模型。此外,我们证明AZR可有效适配不同规模模型,并与多种模型类别兼容。


Avoid Recommending Out-of-Domain Items: Constrained Generative Recommendation with LLMs

Abstract

arXiv:2505.03336v1 Announce Type: cross Abstract: Large Language Models (LLMs) have shown promise for generative recommender systems due to their transformative capabilities in user interaction. However, ensuring they do not recommend out-of-domain (OOD) items remains a challenge. We study two distinct methods to address this issue: RecLM-ret, a retrieval-based method, and RecLM-cgen, a constrained generation method. Both methods integrate seamlessly with existing LLMs to ensure in-domain recommendations. Comprehensive experiments on three recommendation datasets demonstrate that RecLM-cgen consistently outperforms RecLM-ret and existing LLM-based recommender models in accuracy while eliminating OOD recommendations, making it the preferred method for adoption. Additionally, RecLM-cgen maintains strong generalist capabilities and is a lightweight plug-and-play module for easy integration into LLMs, offering valuable practical benefits for the community. Source code is available at https://github.com/microsoft/RecAI

摘要

大语言模型(LLMs)因其在用户交互方面的变革性能力,在生成式推荐系统中展现出巨大潜力。然而,如何确保其不推荐超出领域(OOD)的项目仍是一个挑战。本研究提出两种解决方案:基于检索的方法RecLM-ret和基于约束生成的方法RecLM-cgen。这两种方法均能无缝集成现有LLMs以保证领域内推荐。在三个推荐数据集上的综合实验表明,RecLM-cgen在准确率上持续优于RecLM-ret及现有基于LLM的推荐模型,同时完全消除OOD推荐,成为首选方法。此外,RecLM-cgen保持了强大的通用能力,其轻量级即插即用模块可轻松集成至LLMs,为社区提供了重要的实用价值。


SPAP: Structured Pruning via Alternating Optimization and Penalty Methods

Abstract

arXiv:2505.03373v1 Announce Type: cross Abstract: The deployment of large language models (LLMs) is often constrained by their substantial computational and memory demands. While structured pruning presents a viable approach by eliminating entire network components, existing methods suffer from performance degradation, reliance on heuristic metrics, or expensive finetuning. To address these challenges, we propose SPAP (Structured Pruning via Alternating Optimization and Penalty Methods), a novel and efficient structured pruning framework for LLMs grounded in optimization theory. SPAP formulates the pruning problem through a mixed-integer optimization model, employs a penalty method that effectively makes pruning decisions to minimize pruning errors, and introduces an alternating minimization algorithm tailored to the splittable problem structure for efficient weight updates and performance recovery. Extensive experiments on OPT, LLaMA-3/3.1/3.2, and Qwen2.5 models demonstrate SPAP's superiority over state-of-the-art methods, delivering linear inference speedups (1.29×\times at 30% sparsity) and proportional memory reductions. Our work offers a practical, optimization-driven solution for pruning LLMs while preserving model performance.

摘要

大型语言模型(LLMs)的部署常受限于其巨大的计算和内存需求。结构化剪枝通过移除整个网络组件提供了可行方案,但现有方法存在性能下降、依赖启发式指标或微调成本高等问题。针对这些挑战,我们提出SPAP(基于交替优化与惩罚方法的结构化剪枝)——一种基于优化理论的新型高效LLM结构化剪枝框架。SPAP通过混合整数优化模型构建剪枝问题,采用能有效最小化剪枝误差的惩罚方法进行剪枝决策,并针对可分裂问题结构设计了交替最小化算法以实现高效的权重更新与性能恢复。在OPT、LLaMA-3/3.1/3.2及Qwen2.5模型上的大量实验表明,SPAP优于当前最先进方法,可实现线性推理加速(30%稀疏度时达1.29倍)与内存占用的成比例降低。本研究为保持模型性能的LLM剪枝提供了实用且优化驱动的解决方案。


Automatic Calibration for Membership Inference Attack on Large Language Models

Abstract

arXiv:2505.03392v1 Announce Type: cross Abstract: Membership Inference Attacks (MIAs) have recently been employed to determine whether a specific text was part of the pre-training data of Large Language Models (LLMs). However, existing methods often misinfer non-members as members, leading to a high false positive rate, or depend on additional reference models for probability calibration, which limits their practicality. To overcome these challenges, we introduce a novel framework called Automatic Calibration Membership Inference Attack (ACMIA), which utilizes a tunable temperature to calibrate output probabilities effectively. This approach is inspired by our theoretical insights into maximum likelihood estimation during the pre-training of LLMs. We introduce ACMIA in three configurations designed to accommodate different levels of model access and increase the probability gap between members and non-members, improving the reliability and robustness of membership inference. Extensive experiments on various open-source LLMs demonstrate that our proposed attack is highly effective, robust, and generalizable, surpassing state-of-the-art baselines across three widely used benchmarks. Our code is available at: \href{https://github.com/Salehzz/ACMIA&rbrace;&lbrace;\textcolor&lbrace;blue&rbrace;&lbrace;Github}}.

摘要

成员推理攻击(MIA)最近被用于判断特定文本是否属于大型语言模型(LLM)的预训练数据。然而,现有方法常将非成员错误推断为成员,导致高误报率,或依赖额外的参考模型进行概率校准,限制了其实用性。为克服这些挑战,我们提出了一种名为自动校准成员推理攻击(ACMIA)的新框架,该框架利用可调温度参数有效校准输出概率。这一方法的灵感来源于我们对LLM预训练过程中最大似然估计的理论洞察。我们以三种配置形式引入ACMIA,旨在适应不同的模型访问权限,并通过扩大成员与非成员之间的概率差距,提升成员推理的可靠性与鲁棒性。在多种开源LLM上的大量实验表明,所提出的攻击方法具有高效性、鲁棒性和泛化能力,在三个广泛使用的基准测试中均超越了现有最优基线。代码已发布于:\href{https://github.com/Salehzz/ACMIA&rbrace;&lbrace;\textcolor&lbrace;blue&rbrace;&lbrace;Github&rbrace;&rbrace;。


Lightweight Clinical Decision Support System using QLoRA-Fine-Tuned LLMs and Retrieval-Augmented Generation

Abstract

arXiv:2505.03406v1 Announce Type: cross Abstract: This research paper investigates the application of Large Language Models (LLMs) in healthcare, specifically focusing on enhancing medical decision support through Retrieval-Augmented Generation (RAG) integrated with hospital-specific data and fine-tuning using Quantized Low-Rank Adaptation (QLoRA). The system utilizes Llama 3.2-3B-Instruct as its foundation model. By embedding and retrieving context-relevant healthcare information, the system significantly improves response accuracy. QLoRA facilitates notable parameter efficiency and memory optimization, preserving the integrity of medical information through specialized quantization techniques. Our research also shows that our model performs relatively well on various medical benchmarks, indicating that it can be used to make basic medical suggestions. This paper details the system's technical components, including its architecture, quantization methods, and key healthcare applications such as enhanced disease prediction from patient symptoms and medical history, treatment suggestions, and efficient summarization of complex medical reports. We touch on the ethical considerations-patient privacy, data security, and the need for rigorous clinical validation-as well as the practical challenges of integrating such systems into real-world healthcare workflows. Furthermore, the lightweight quantized weights ensure scalability and ease of deployment even in low-resource hospital environments. Finally, the paper concludes with an analysis of the broader impact of LLMs on healthcare and outlines future directions for LLMs in medical settings.

摘要

本研究探讨了大型语言模型(LLMs)在医疗领域的应用,重点研究如何通过检索增强生成技术(RAG)整合医院特定数据,并采用量化低秩自适应方法(QLoRA)进行微调,从而提升医疗决策支持能力。该系统以Llama 3.2-3B-Instruct为基础模型,通过嵌入和检索上下文相关的医疗信息,显著提高了响应准确性。QLoRA通过专门的量化技术实现了显著的参数效率和内存优化,同时保持了医疗信息的完整性。研究表明,该模型在多项医疗基准测试中表现良好,能够用于提供基础医疗建议。本文详细阐述了系统的技术组件,包括架构设计、量化方法,以及关键医疗应用场景,如基于患者症状和病史的疾病预测增强、治疗方案建议和复杂医疗报告的高效摘要。同时探讨了伦理考量(患者隐私、数据安全及严格临床验证的必要性)以及将此类系统整合到实际医疗工作流程中的实践挑战。轻量化的量化权重设计确保了系统在资源有限医院环境中的可扩展性和易部署性。最后,本文分析了LLMs对医疗健康的广泛影响,并展望了其在医疗场景中的未来发展方向。


MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks

Abstract

arXiv:2505.03427v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated significant promise for various applications in healthcare. However, their efficacy in the Arabic medical domain remains unexplored due to the lack of high-quality domain-specific datasets and benchmarks. This study introduces MedArabiQ, a novel benchmark dataset consisting of seven Arabic medical tasks, covering multiple specialties and including multiple choice questions, fill-in-the-blank, and patient-doctor question answering. We first constructed the dataset using past medical exams and publicly available datasets. We then introduced different modifications to evaluate various LLM capabilities, including bias mitigation. We conducted an extensive evaluation with five state-of-the-art open-source and proprietary LLMs, including GPT-4o, Claude 3.5-Sonnet, and Gemini 1.5. Our findings highlight the need for the creation of new high-quality benchmarks that span different languages to ensure fair deployment and scalability of LLMs in healthcare. By establishing this benchmark and releasing the dataset, we provide a foundation for future research aimed at evaluating and enhancing the multilingual capabilities of LLMs for the equitable use of generative AI in healthcare.

摘要

大型语言模型(LLMs)在医疗健康领域的多种应用中展现出显著潜力。然而,由于缺乏高质量的领域特定数据集和基准测试,其在阿拉伯语医疗领域的效能尚未得到充分探索。本研究提出MedArabiQ——一个包含七项阿拉伯语医疗任务的新型基准数据集,涵盖多个专科领域,题型包括多选题、填空题及医患问答。我们首先通过历年医学考试和公开数据集构建该数据集,随后引入多种修改方案以评估LLMs的各项能力(包括偏见缓解)。我们对五种最先进的开源和商业LLMs(包括GPT-4o、Claude 3.5-Sonnet和Gemini 1.5)进行了全面评估。研究结果强调需要创建跨语言的新高质量基准,以确保LLMs在医疗领域的公平部署和可扩展性。通过建立该基准并公开数据集,我们为未来研究奠定了基础,旨在评估和增强LLMs的多语言能力,促进生成式人工智能在医疗健康领域的公平应用。


Augmenting Human Cognition through Everyday AR

Abstract

arXiv:2505.03492v1 Announce Type: cross Abstract: As spatial computing and multimodal LLMs mature, AR is tending to become an intuitive "thinking tool," embedding semantic and context-aware intelligence directly into everyday environments. This paper explores how always-on AR can seamlessly bridge digital cognition and physical affordances, enabling proactive, context-sensitive interactions that enhance human task performance and understanding.

摘要

随着空间计算和多模态大语言模型的成熟,增强现实(AR)正逐渐演变为一种直观的"思维工具",将语义理解和情境感知智能直接嵌入日常环境。本文探讨了常启型AR如何无缝衔接数字认知与物理可供性,通过主动且情境敏感的交互方式,提升人类任务执行效能与环境理解能力。


LlamaFirewall: An open source guardrail system for building secure AI agents

Abstract

arXiv:2505.03574v1 Announce Type: cross Abstract: Large language models (LLMs) have evolved from simple chatbots into autonomous agents capable of performing complex tasks such as editing production code, orchestrating workflows, and taking higher-stakes actions based on untrusted inputs like webpages and emails. These capabilities introduce new security risks that existing security measures, such as model fine-tuning or chatbot-focused guardrails, do not fully address. Given the higher stakes and the absence of deterministic solutions to mitigate these risks, there is a critical need for a real-time guardrail monitor to serve as a final layer of defense, and support system level, use case specific safety policy definition and enforcement. We introduce LlamaFirewall, an open-source security focused guardrail framework designed to serve as a final layer of defense against security risks associated with AI Agents. Our framework mitigates risks such as prompt injection, agent misalignment, and insecure code risks through three powerful guardrails: PromptGuard 2, a universal jailbreak detector that demonstrates clear state of the art performance; Agent Alignment Checks, a chain-of-thought auditor that inspects agent reasoning for prompt injection and goal misalignment, which, while still experimental, shows stronger efficacy at preventing indirect injections in general scenarios than previously proposed approaches; and CodeShield, an online static analysis engine that is both fast and extensible, aimed at preventing the generation of insecure or dangerous code by coding agents. Additionally, we include easy-to-use customizable scanners that make it possible for any developer who can write a regular expression or an LLM prompt to quickly update an agent's security guardrails.

摘要

大语言模型(LLMs)已从简单的聊天机器人发展为能够执行复杂任务的自主智能体,例如编辑生产代码、协调工作流程,以及基于网页和电子邮件等不可信输入采取更高风险的操作。这些能力引入了新的安全风险,而现有的安全措施(如模型微调或针对聊天机器人的防护栏)无法完全应对。鉴于风险更高且缺乏确定性解决方案来缓解这些风险,亟需一种实时防护栏监控机制作为最终防御层,并支持系统级、针对特定用例的安全策略定义与执行。我们提出了LlamaFirewall,这是一个专注于安全的开源防护栏框架,旨在作为抵御AI智能体相关安全风险的最终防御层。该框架通过三重强大防护机制降低风险:PromptGuard 2——一种通用越狱检测器,展现出显著的业界领先性能;Agent Alignment Checks——一种思维链审查器,可检测智能体推理过程中的提示注入与目标偏离问题(虽然仍处于实验阶段,但在通用场景中间接注入防护效果优于现有方法);以及CodeShield——一个快速且可扩展的在线静态分析引擎,用于阻止编码智能体生成不安全或危险代码。此外,我们还提供易于定制的扫描器,使任何能编写正则表达式或LLM提示的开发者都能快速更新智能体的安全防护栏。


ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant

Abstract

arXiv:2505.03654v1 Announce Type: cross Abstract: Recent advances in personalized MLLMs enable effective capture of user-specific concepts, supporting both recognition of personalized concepts and contextual captioning. However, humans typically explore and reason over relations among objects and individuals, transcending surface-level information to achieve more personalized and contextual understanding. To this end, existing methods may face three main limitations: Their training data lacks multi-object sets in which relations among objects are learnable. Building on the limited training data, their models overlook the relations between different personalized concepts and fail to reason over them. Their experiments mainly focus on a single personalized concept, where evaluations are limited to recognition and captioning tasks. To address the limitations, we present a new dataset named ReGraP, consisting of 120 sets of personalized knowledge. Each set includes images, KGs, and CoT QA pairs derived from the KGs, enabling more structured and sophisticated reasoning pathways. We propose ReGraP-LLaVA, an MLLM trained with the corresponding KGs and CoT QA pairs, where soft and hard graph prompting methods are designed to align KGs within the model's semantic space. We establish the ReGraP Benchmark, which contains diverse task types: multiple-choice, fill-in-the-blank, True/False, and descriptive questions in both open- and closed-ended settings. The proposed benchmark is designed to evaluate the relational reasoning and knowledge-connection capability of personalized MLLMs. We conduct experiments on the proposed ReGraP-LLaVA and other competitive MLLMs. Results show that the proposed model not only learns personalized knowledge but also performs relational reasoning in responses, achieving the SoTA performance compared with the competitive methods. All the codes and datasets are released at: https://github.com/xyfyyds/ReGraP.

摘要

个性化多模态大语言模型(MLLMs)的最新进展能够有效捕捉用户特定概念,同时支持个性化概念识别和上下文描述。然而,人类通常会对物体与个体间的关系进行探索和推理,超越表层信息以实现更具个性化和情境化的理解。现有方法可能面临三个主要局限:其训练数据缺乏可学习物体间关系的多对象集合;在有限训练数据基础上,模型忽视了不同个性化概念间的关系且无法进行推理;实验主要集中于单一个性化概念,评估仅局限于识别和描述任务。

为此,我们提出名为ReGraP的新数据集,包含120组个性化知识集合。每组包含图像、知识图谱(KGs)以及基于KGs生成的思维链问答对(CoT QA),支持更结构化、更复杂的推理路径。我们提出ReGraP-LLaVA模型,通过相应KGs和CoT QA对进行训练,并设计软硬两种图提示方法以对齐模型语义空间内的知识图谱。建立ReGraP基准测试,涵盖多种任务类型:开放式与封闭式设置下的多选题、填空题、判断题及描述性问题。该基准旨在评估个性化MLLMs的关系推理与知识联结能力。

我们对ReGraP-LLaVA及其他竞争性MLLMs进行实验。结果表明,所提模型不仅能学习个性化知识,还能在响应中执行关系推理,相较竞争方法达到最先进性能。所有代码与数据集发布于:https://github.com/xyfyyds/ReGraP。


VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

Abstract

arXiv:2505.03739v1 Announce Type: cross Abstract: With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.

摘要

随着对自然人机交互需求的增长,基于语音的系统因其作为日常最常见交流形式之一而受到越来越多的关注。然而现有语音模型在流式场景下生成首个音频令牌时仍存在较高延迟,这成为部署过程中的显著瓶颈。为解决这一问题,我们提出VITA-Audio——一种具有快速音频-文本令牌生成能力的端到端大型语音模型。具体而言,我们引入轻量级多跨模态令牌预测(MCTP)模块,可在单次模型前向传播中高效生成多个音频令牌,不仅加速推理过程,更显著降低流式场景下首音频生成的延迟。此外,我们探索了四阶段渐进式训练策略,在语音质量损失最小化的前提下实现模型加速。据我们所知,VITA-Audio是首个能在首次前向传播时生成音频输出的多模态大语言模型,可实现毫秒级延迟的实时对话能力。该模型完全可复现且仅使用开源数据训练。实验结果表明,在70亿参数规模下,我们的模型不仅实现了3~5倍的推理加速,还在自动语音识别(ASR)、文本转语音(TTS)和语音问答(SQA)任务的多个基准测试中显著优于同规模开源模型。


Language Models Trained to do Arithmetic Predict Human Risky and Intertemporal Choice

Abstract

arXiv:2405.19313v2 Announce Type: replace Abstract: The observed similarities in the behavior of humans and Large Language Models (LLMs) have prompted researchers to consider the potential of using LLMs as models of human cognition. However, several significant challenges must be addressed before LLMs can be legitimately regarded as cognitive models. For instance, LLMs are trained on far more data than humans typically encounter, and may have been directly trained on human data in specific cognitive tasks or aligned with human preferences. Consequently, the origins of these behavioral similarities are not well understood. In this paper, we propose a novel way to enhance the utility of LLMs as cognitive models. This approach involves (i) leveraging computationally equivalent tasks that both an LLM and a rational agent need to master for solving a cognitive problem and (ii) examining the specific task distributions required for an LLM to exhibit human-like behaviors. We apply this approach to decision-making -- specifically risky and intertemporal choice -- where the key computationally equivalent task is the arithmetic of expected value calculations. We show that an LLM pretrained on an ecologically valid arithmetic dataset, which we call Arithmetic-GPT, predicts human behavior better than many traditional cognitive models. Pretraining LLMs on ecologically valid arithmetic datasets is sufficient to produce a strong correspondence between these models and human decision-making. Our results also suggest that LLMs used as cognitive models should be carefully investigated via ablation studies of the pretraining data.

摘要

人类与大型语言模型(LLM)在行为上的相似性促使研究者开始探讨将LLM作为人类认知模型的潜力。然而,在合法地将LLM视为认知模型之前,仍需解决若干关键挑战。例如,LLM的训练数据量远超人类通常接触的范围,且可能直接针对特定认知任务中的人类数据进行了训练或与人类偏好对齐,导致这些行为相似性的根源尚未明晰。本文提出一种新方法来提升LLM作为认知模型的实用性:(i)利用LLM与理性主体在解决认知问题时均需掌握的"计算等效任务";(ii)分析LLM展现类人行为所需的具体任务分布。我们将此方法应用于决策研究(特别是风险决策与跨期选择),其关键计算等效任务为期望值计算的算术运算。研究表明,在生态效度良好的算术数据集(Arithmetic-GPT)上预训练的LLM,其人类行为预测能力优于许多传统认知模型。这种预训练足以使模型与人类决策行为产生高度对应性。结果同时提示,当LLM作为认知模型时,应通过预训练数据的消融研究进行严谨验证。


Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization

Abstract

arXiv:2410.13333v3 Announce Type: replace Abstract: As the scale of models and training data continues to grow, there is an expanding reliance on more GPUs to train large-scale models, which inevitably increases the likelihood of encountering dynamic stragglers that some devices lag behind in performance occasionally. However, hybrid parallel training, one of the de facto paradigms to train large models, is typically sensitive to the stragglers. This paper presents Malleus, a straggler-resilient hybrid parallel training framework for large-scale models. Malleus quantifies the stragglers at the nuanced, per-GPU granularity during training, and develops a novel planning algorithm to deduce the optimal parallelization of GPU devices, pipeline stages, model layers, and training data, maximizing training efficiency when stragglers exist. In addition, once a shift in the straggler situation is detected, Malleus adaptively adjusts the parallelization via a re-planning process, and seamlessly and efficiently migrates the model states on the fly, without sacrificing the stability of the training tasks. Empirical results on large language models with up to 110B parameters show that Malleus consistently outperforms existing parallel training frameworks under various straggler situations, delivering on average 2.63-5.28 times of efficiency improvement.

摘要

随着模型规模和训练数据量持续增长,大规模模型训练对GPU数量的需求不断扩大,这不可避免地增加了遭遇动态落后者(部分设备偶尔出现性能滞后的现象)的可能性。然而作为训练大模型的事实范式之一,混合并行训练通常对落后者现象极为敏感。本文提出Malleus——一种面向大规模模型的抗落后混合并行训练框架。Malleus在训练过程中以细粒度的单GPU为单位量化落后者影响,并开发了新型规划算法来推导GPU设备、流水线阶段、模型层与训练数据的最优并行方案,从而在存在落后者时最大化训练效率。此外,当检测到落后者情况变化时,Malleus通过重新规划过程自适应调整并行策略,并能在不牺牲训练任务稳定性的前提下,高效无缝地实时迁移模型状态。在参数规模高达1100亿的大语言模型上的实验表明,Malleus在不同落后者场景下始终优于现有并行训练框架,平均可实现2.63-5.28倍的效率提升。


FastRM: An efficient and automatic explainability framework for multimodal generative models

Abstract

arXiv:2412.01487v4 Announce Type: replace Abstract: Large Vision Language Models (LVLMs) have demonstrated remarkable reasoning capabilities over textual and visual inputs. However, these models remain prone to generating misinformation. Identifying and mitigating ungrounded responses is crucial for developing trustworthy AI. Traditional explainability methods such as gradient-based relevancy maps, offer insight into the decision process of models, but are often computationally expensive and unsuitable for real-time output validation. In this work, we introduce FastRM, an efficient method for predicting explainable Relevancy Maps of LVLMs. Furthermore, FastRM provides both quantitative and qualitative assessment of model confidence. Experimental results demonstrate that FastRM achieves a 99.8% reduction in computation time and a 44.4% reduction in memory footprint compared to traditional relevancy map generation. FastRM allows explainable AI to be more practical and scalable, thereby promoting its deployment in real-world applications and enabling users to more effectively evaluate the reliability of model outputs.

摘要

大型视觉语言模型(LVLMs)在文本和视觉输入上展现出卓越的推理能力,但这些模型仍易生成错误信息。识别并减少无依据的响应对于开发可信赖的人工智能至关重要。传统的可解释性方法(如基于梯度的相关性图)虽能揭示模型的决策过程,但通常计算成本高昂且不适用于实时输出验证。本研究提出FastRM,一种高效预测LVLMs可解释相关性图的方法。此外,FastRM还能对模型置信度进行定量与定性评估。实验结果表明,与传统相关性图生成方法相比,FastRM实现了99.8%的计算时间缩减和44.4%的内存占用降低。该方法使可解释人工智能更具实用性和可扩展性,从而推动其在实际应用中的部署,并帮助用户更有效地评估模型输出的可靠性。


Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Abstract

arXiv:2412.17739v3 Announce Type: replace Abstract: Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While existing works mainly address RoPE's limitations within attention mechanism, this paper provides an analysis across nearly all parts of LMs, uncovering their adverse effects on length generalization for RoPE-based attention. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs Fourier Series and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales and benchmarks show that, within varying context windows, FoPE maintains a more stable performance compared to RoPE and ALiBi. Several analyses and ablations bring further support to our method and theoretical modeling.

摘要

通过改进旋转位置嵌入(RoPE)来扩展语言模型(LM)的上下文长度已成为趋势。现有研究主要关注RoPE在注意力机制内的局限性,而本文则对LM几乎所有组成部分进行了分析,揭示了它们对基于RoPE的注意力长度泛化的不利影响。利用离散信号处理理论,我们证明RoPE通过隐式实现非均匀离散傅里叶变换,实现了周期性注意力。然而,这种周期性会因以下因素导致的频谱损伤而削弱:1)注意力机制外的线性层和激活函数;2)时域截断带来的训练不足的频率成分。基于这些观察,我们提出傅里叶位置嵌入(FoPE),通过增强注意力在频域的特性来改善其周期性扩展和长度泛化能力。FoPE构建傅里叶级数并零值化破坏性频率成分,从而提升模型对频谱损伤的鲁棒性。在不同模型规模和基准测试上的实验表明,在变化的上下文窗口内,FoPE相比RoPE和ALiBi能保持更稳定的性能。多项分析与消融实验进一步验证了我们的方法和理论建模。


AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation

Abstract

arXiv:2412.18116v3 Announce Type: replace Abstract: Large language models (LLMs) have brought exciting new advances to mobile UI agents, a long-standing research field that aims to complete arbitrary natural language tasks through mobile UI interactions. However, existing UI agents usually demand powerful large language models that are difficult to be deployed locally on end-users' devices, raising huge concerns about user privacy and centralized serving cost. Inspired by the remarkable coding abilities of recent small language models (SLMs), we propose to convert the UI task automation problem to a code generation problem, which can be effectively solved by an on-device SLM and efficiently executed with an on-device code interpreter. Unlike normal coding tasks that can be extensively pre-trained with public datasets, generating UI automation code is challenging due to the diversity, complexity, and variability of target apps. Therefore, we adopt a document-centered approach that automatically builds fine-grained API documentation for each app and generates diverse task samples based on this documentation. By guiding the agent with the synthetic documents and task samples, it learns to generate precise and efficient scripts to complete unseen tasks. Based on detailed comparisons with state-of-the-art mobile UI agents, our approach effectively improves the mobile task automation with significantly higher success rates and lower latency/token consumption. Code is open-sourced at https://github.com/MobileLLM/AutoDroid-V2.

摘要

大型语言模型(LLMs)为移动用户界面代理带来了令人振奋的新进展,这一长期研究领域旨在通过移动界面交互完成任意自然语言任务。然而,现有界面代理通常需要部署在终端用户设备上难度较高的大模型,引发了用户隐私和集中式服务成本的重大担忧。受近期小型语言模型(SLMs)卓越编码能力的启发,我们提出将界面任务自动化问题转化为代码生成问题,该问题可通过设备端小型语言模型有效解决,并借助设备端代码解释器高效执行。与可利用公开数据集广泛预训练的常规编码任务不同,由于目标应用的多样性、复杂性和多变性,生成界面自动化代码具有挑战性。为此,我们采用以文档为中心的方法,自动为每个应用构建细粒度的API文档,并基于该文档生成多样化任务样本。通过使用合成文档和任务样本引导代理,其可学习生成精确高效的脚本来完成未见任务。与最先进的移动界面代理进行详细对比表明,我们的方法显著提高了移动任务自动化成功率,同时大幅降低了延迟/令牌消耗。代码已开源:https://github.com/MobileLLM/AutoDroid-V2。


Synergizing Large Language Models and Task-specific Models for Time Series Anomaly Detection

Abstract

arXiv:2501.05675v4 Announce Type: replace Abstract: In anomaly detection, methods based on large language models (LLMs) can incorporate expert knowledge by reading professional document, while task-specific small models excel at extracting normal data patterns and detecting value fluctuations from training data of target applications. Inspired by the human nervous system, where the brain stores expert knowledge and the peripheral nervous system and spinal cord handle specific tasks like withdrawal and knee-jerk reflexes, we propose CoLLaTe, a framework designed to facilitate collaboration between LLMs and task-specific models, leveraging the strengths of both models for anomaly detection. In particular, we first formulate the collaboration process and identify two key challenges in the collaboration: (1) the misalignment between the expression domains of the LLMs and task-specific small models, and (2) error accumulation arising from the predictions of both models. To address these challenges, we then introduce two key components in CoLLaTe: a model alignment module and a collaborative loss function. Through theoretical analysis and experimental validation, we demonstrate that these components effectively mitigate the identified challenges and achieve better performance than both LLM-based and task-specific models.

摘要

在异常检测领域,基于大语言模型(LLMs)的方法能够通过阅读专业文献融入专家知识,而针对特定任务的小型模型则擅长从目标应用的训练数据中提取正常数据模式并检测数值波动。受人类神经系统的启发——大脑存储专家知识,外周神经系统和脊髓处理诸如退缩反射和膝跳反射等特定任务——我们提出了CoLLaTe框架,旨在促进大语言模型与任务专用模型之间的协作,充分发挥两种模型在异常检测中的优势。

具体而言,我们首先形式化了协作过程,并识别出协作中的两个关键挑战:(1)大语言模型与任务专用小型模型在表达域上的不匹配;(2)两种模型预测结果导致的误差累积。针对这些挑战,我们在CoLLaTe中引入了两个核心组件:模型对齐模块和协作损失函数。通过理论分析和实验验证,我们证明这些组件能有效缓解上述挑战,并取得优于单纯基于大语言模型或任务专用模型的性能表现。


Cooperative Multi-Agent Planning with Adaptive Skill Synthesis

Abstract

arXiv:2502.10148v2 Announce Type: replace Abstract: Despite much progress in training distributed artificial intelligence (AI), building cooperative multi-agent systems with multi-agent reinforcement learning (MARL) faces challenges in sample efficiency, interpretability, and transferability. Unlike traditional learning-based methods that require extensive interaction with the environment, large language models (LLMs) demonstrate remarkable capabilities in zero-shot planning and complex reasoning. However, existing LLM-based approaches heavily rely on text-based observations and struggle with the non-Markovian nature of multi-agent interactions under partial observability. We present COMPASS, a novel multi-agent architecture that integrates vision-language models (VLMs) with a dynamic skill library and structured communication for decentralized closed-loop decision-making. The skill library, bootstrapped from demonstrations, evolves via planner-guided tasks to enable adaptive strategies. COMPASS propagates entity information through multi-hop communication under partial observability. Evaluations on the improved StarCraft Multi-Agent Challenge (SMACv2) demonstrate COMPASS's strong performance against state-of-the-art MARL baselines across both symmetric and asymmetric scenarios. Notably, in the symmetric Protoss 5v5 task, COMPASS achieved a 57% win rate, representing a 30 percentage point advantage over QMIX (27%). Project page can be found at https://stellar-entremet-1720bb.netlify.app/.

摘要

尽管分布式人工智能(AI)训练已取得显著进展,但基于多智能体强化学习(MARL)构建协作式多智能体系统仍面临样本效率、可解释性和可迁移性等挑战。传统基于学习的方法需要与环境进行大量交互,而大语言模型(LLM)在零样本规划和复杂推理方面展现出卓越能力。然而现有基于LLM的方法严重依赖文本观测,且难以处理部分可观测条件下多智能体交互的非马尔可夫特性。我们提出COMPASS——一种集成视觉语言模型(VLM)、动态技能库与结构化通信的新型多智能体架构,可实现去中心化闭环决策。该技能库通过演示样本初始化,并借助规划器引导的任务进行动态演化以实现自适应策略。COMPASS能在部分可观测条件下通过多跳通信传递实体信息。在改进版《星际争霸》多智能体挑战(SMACv2)上的评估表明,COMPASS在对称与非对称场景中均优于最先进的MARL基线方法。值得注意的是,在对称场景Protoss 5v5任务中,COMPASS以57%的胜率显著超越QMIX(27%)30个百分点。项目页面详见https://stellar-entremet-1720bb.netlify.app/。


Co-NavGPT: Multi-Robot Cooperative Visual Semantic Navigation Using Vision Language Models

Abstract

arXiv:2310.07937v3 Announce Type: replace-cross Abstract: Visual target navigation is a critical capability for autonomous robots operating in unknown environments, particularly in human-robot interaction scenarios. While classical and learning-based methods have shown promise, most existing approaches lack common-sense reasoning and are typically designed for single-robot settings, leading to reduced efficiency and robustness in complex environments. To address these limitations, we introduce Co-NavGPT, a novel framework that integrates a Vision Language Model (VLM) as a global planner to enable common-sense multi-robot visual target navigation. Co-NavGPT aggregates sub-maps from multiple robots with diverse viewpoints into a unified global map, encoding robot states and frontier regions. The VLM uses this information to assign frontiers across the robots, facilitating coordinated and efficient exploration. Experiments on the Habitat-Matterport 3D (HM3D) demonstrate that Co-NavGPT outperforms existing baselines in terms of success rate and navigation efficiency, without requiring task-specific training. Ablation studies further confirm the importance of semantic priors from the VLM. We also validate the framework in real-world scenarios using quadrupedal robots. Supplementary video and code are available at: https://sites.google.com/view/co-navgpt2.

摘要

视觉目标导航是自主机器人在未知环境中运行的关键能力,尤其在人类-机器人交互场景中。尽管基于经典方法和学习的方法已展现出潜力,但现有方案大多缺乏常识推理能力,且通常针对单机器人场景设计,导致在复杂环境中效率与鲁棒性降低。为解决这些局限,我们提出Co-NavGPT——一种集成视觉语言模型(VLM)作为全局规划器的新型框架,可实现具备常识推理的多机器人视觉目标导航。该框架将多台具有异构视角的机器人子地图聚合为统一全局地图,并编码机器人状态与边界区域。视觉语言模型利用这些信息为各机器人分配探索边界,实现协同高效探索。在Habitat-Matterport 3D(HM3D)数据集上的实验表明,Co-NavGPT在成功率和导航效率方面均优于现有基线方法,且无需任务特定训练。消融实验进一步验证了视觉语言模型提供的语义先验的重要性。我们还通过四足机器人平台在真实场景中验证了该框架的有效性。补充视频与代码详见:https://sites.google.com/view/co-navgpt2。


Incoherent Probability Judgments in Large Language Models

Abstract

arXiv:2401.16646v2 Announce Type: replace-cross Abstract: Autoregressive Large Language Models (LLMs) trained for next-word prediction have demonstrated remarkable proficiency at producing coherent text. But are they equally adept at forming coherent probability judgments? We use probabilistic identities and repeated judgments to assess the coherence of probability judgments made by LLMs. Our results show that the judgments produced by these models are often incoherent, displaying human-like systematic deviations from the rules of probability theory. Moreover, when prompted to judge the same event, the mean-variance relationship of probability judgments produced by LLMs shows an inverted-U-shaped like that seen in humans. We propose that these deviations from rationality can be explained by linking autoregressive LLMs to implicit Bayesian inference and drawing parallels with the Bayesian Sampler model of human probability judgments.

摘要

基于自回归架构、以下一词预测为目标训练的大语言模型(LLMs)已展现出生成连贯文本的卓越能力。但它们能否同样形成连贯的概率判断?我们通过概率恒等式和重复判断任务评估LLMs概率判断的连贯性。结果表明,这些模型产生的判断往往缺乏一致性,表现出类似人类的系统性概率规则偏离现象。此外,当要求模型对同一事件进行重复判断时,其概率判断的均值-方差关系呈现与人类相似的倒U型曲线。我们提出,通过将自回归LLMs与隐式贝叶斯推理联系起来,并类比人类概率判断的"贝叶斯采样器"模型,可以解释这些偏离理性准则的现象。


Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph

Abstract

arXiv:2406.07113v4 Announce Type: replace-cross Abstract: Locating objects described in natural language presents a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object grounding with simple (bare) queries, but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene graph representation with metric and semantic spatial edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to construct 3D object-centric map and an advanced raycasting algorithm with a 2D vision-language model to describe them as graph nodes. On the Replica and ScanNet datasets, we have demonstrated that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On challenging Sr3D+, Nr3D and ScanRefer benchmarks, our deductive approach demonstrates a significant improvement, enabling objects grounding by complex queries compared to other state-of-the-art methods. The combination of our design choices and software implementation has resulted in significant data processing speed in experiments on the robot on-board computer. This promising performance enables the application of our approach in intelligent robotics projects. We made the code publicly available at https://linukc.github.io/BeyondBareQueries/.

摘要

基于自然语言描述定位物体对自主智能体构成重大挑战。现有基于CLIP的开放词汇方法虽能成功处理简单(基础)查询的3D物体定位,但无法应对需要理解物体关系的模糊描述。为解决该问题,我们提出模块化方法BBQ(超越基础查询),该方法通过构建具有度量与语义空间边的3D场景图表示,并利用大语言模型作为人机交互接口,结合我们提出的演绎式场景推理算法。BBQ采用基于DINO的强健关联构建3D物体中心地图,并通过配备2D视觉语言模型的高级光线投射算法将物体描述为图节点。在Replica和ScanNet数据集上的实验表明,相较于其他零样本方法,BBQ在开放词汇3D语义分割任务中处于领先地位。同时,我们证明利用空间关系对包含同类语义实体的场景特别有效。在Sr3D+、Nr3D和ScanRefer基准测试中,我们的演绎式方法相较其他前沿技术展现出显著优势,能够通过复杂查询实现物体定位。我们的设计选择与软件实现相结合,在机器人车载计算机实验中实现了显著的数据处理速度。这一优异性能使得我们的方法可应用于智能机器人项目。代码已开源:https://linukc.github.io/BeyondBareQueries/。


The Struggles of LLMs in Cross-lingual Code Clone Detection

Abstract

arXiv:2408.04430v3 Announce Type: replace-cross Abstract: With the involvement of multiple programming languages in modern software development, cross-lingual code clone detection has gained traction within the software engineering community. Numerous studies have explored this topic, proposing various promising approaches. Inspired by the significant advances in machine learning in recent years, particularly Large Language Models (LLMs), which have demonstrated their ability to tackle various tasks, this paper revisits cross-lingual code clone detection. We evaluate the performance of five (05) LLMs and eight prompts (08) for the identification of cross-lingual code clones. Additionally, we compare these results against two baseline methods. Finally, we evaluate a pre-trained embedding model to assess the effectiveness of the generated representations for classifying clone and non-clone pairs. The studies involving LLMs and Embedding models are evaluated using two widely used cross-lingual datasets, XLCoST and CodeNet. Our results show that LLMs can achieve high F1 scores, up to 0.99, for straightforward programming examples. However, they not only perform less well on programs associated with complex programming challenges but also do not necessarily understand the meaning of "code clones" in a cross-lingual setting. We show that embedding models used to represent code fragments from different programming languages in the same representation space enable the training of a basic classifier that outperforms all LLMs by ~1 and ~20 percentage points on the XLCoST and CodeNet datasets, respectively. This finding suggests that, despite the apparent capabilities of LLMs, embeddings provided by embedding models offer suitable representations to achieve state-of-the-art performance in cross-lingual code clone detection.

摘要

随着现代软件开发中多编程语言的参与,跨语言代码克隆检测在软件工程领域受到广泛关注。已有大量研究探索该主题,并提出了多种有效方法。近年来机器学习尤其是大语言模型(LLMs)取得的重大进展,展示了其处理各类任务的能力,受此启发,本文重新审视跨语言代码克隆检测问题。我们评估了五种(05)大语言模型和八种提示(08)在识别跨语言代码克隆方面的性能,同时将结果与两种基线方法进行对比。最后,我们评估了一个预训练嵌入模型,以检验生成表征在克隆对与非克隆对分类中的有效性。涉及大语言模型和嵌入模型的研究使用两个广泛采用的跨语言数据集XLCoST和CodeNet进行评估。结果表明:对于简单编程示例,大语言模型可获得高达0.99的F1分数;但在涉及复杂编程挑战的程序上表现欠佳,且未必能真正理解跨语言场景下"代码克隆"的含义。我们发现,嵌入模型通过将不同编程语言的代码片段映射到同一表征空间,使得训练的基础分类器在XLCoST和CodeNet数据集上分别以约1%和20%的优势超越所有大语言模型。这一发现表明,尽管大语言模型具有显著能力,但嵌入模型提供的表征能够实现跨语言代码克隆检测的最优性能。


LLM-3D Print: Large Language Models To Monitor and Control 3D Printing

Abstract

arXiv:2408.14307v2 Announce Type: replace-cross Abstract: Industry 4.0 has revolutionized manufacturing by driving digitalization and shifting the paradigm toward additive manufacturing (AM). Fused Deposition Modeling (FDM), a key AM technology, enables the creation of highly customized, cost-effective products with minimal material waste through layer-by-layer extrusion, posing a significant challenge to traditional subtractive methods. However, the susceptibility of material extrusion techniques to errors often requires expert intervention to detect and mitigate defects that can severely compromise product quality. While automated error detection and machine learning models exist, their generalizability across diverse 3D printer setups, firmware, and sensors is limited, and deep learning methods require extensive labeled datasets, hindering scalability and adaptability. To address these challenges, we present a process monitoring and control framework that leverages pre-trained Large Language Models (LLMs) alongside 3D printers to detect and address printing defects. The LLM evaluates print quality by analyzing images captured after each layer or print segment, identifying failure modes and querying the printer for relevant parameters. It then generates and executes a corrective action plan. We validated the effectiveness of the proposed framework in identifying defects by comparing it against a control group of engineers with diverse AM expertise. Our evaluation demonstrated that LLM-based agents not only accurately identify common 3D printing errors, such as inconsistent extrusion, stringing, warping, and layer adhesion, but also effectively determine the parameters causing these failures and autonomously correct them without any need for human intervention.

摘要

工业4.0通过推动数字化并将制造范式转向增材制造(AM)引发了制造业革命。作为关键增材技术,熔融沉积成型(FDM)通过逐层挤出工艺实现了高度定制化、低成本且材料浪费极少的产品的制造,对传统减材方法构成重大挑战。然而,材料挤出技术易受误差影响的特性常需专家介入以检测和消除可能严重影响产品质量的缺陷。尽管现有自动错误检测和机器学习模型,但其在不同3D打印机配置、固件和传感器间的泛化能力有限,且深度学习方法需要大量标注数据集,制约了可扩展性和适应性。针对这些挑战,我们提出一种结合预训练大语言模型(LLM)与3D打印机的过程监控与调控框架,用于检测和处理打印缺陷。该LLM通过分析每层或打印段完成后捕获的图像评估打印质量,识别故障模式并向打印机查询相关参数,随后生成并执行纠正措施方案。我们通过将其与具有不同增材制造专业背景的工程师对照组进行比较,验证了该框架在缺陷识别方面的有效性。评估结果表明,基于LLM的智能体不仅能准确识别常见3D打印错误(如挤出不均、拉丝、翘曲和层间粘附问题),还能有效判定导致这些故障的参数并自主实施纠正,完全无需人工干预。


Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution

Abstract

arXiv:2410.00153v3 Announce Type: replace-cross Abstract: Probing learned concepts in large language models (LLMs) is crucial for understanding how semantic knowledge is encoded internally. Training linear classifiers on probing tasks is a principle approach to denote the vector of a certain concept in the representation space. However, the single vector identified for a concept varies with both data and training, making it less robust and weakening its effectiveness in real-world applications. To address this challenge, we propose an approach to approximate the subspace representing a specific concept. Built on linear probing classifiers, we extend the concept vectors into Gaussian Concept Subspace (GCS). We demonstrate GCS's effectiveness through measuring its faithfulness and plausibility across multiple LLMs with different sizes and architectures. Additionally, we use representation intervention tasks to showcase its efficacy in real-world applications such as emotion steering. Experimental results indicate that GCS concept vectors have the potential to balance steering performance and maintaining the fluency in natural language generation tasks.

摘要

探究大型语言模型(LLMs)中已学习概念的表征对于理解语义知识如何内部编码至关重要。在探测任务上训练线性分类器是表示空间中特定概念向量的主要方法。然而,针对某一概念识别的单一向量会随数据和训练过程而变化,导致其鲁棒性不足并削弱实际应用效果。为解决这一挑战,我们提出了一种近似表征特定概念的子空间方法。基于线性探测分类器,我们将概念向量扩展为高斯概念子空间(GCS)。通过在不同规模和架构的多个LLM上测量其忠实性与合理性,我们验证了GCS的有效性。此外,我们利用表征干预任务展示了其在情感引导等实际应用中的效能。实验结果表明,GCS概念向量能在自然语言生成任务中平衡引导性能与文本流畅性。


Uncertainty-Guided Self-Questioning and Answering for Video-Language Alignment

Abstract

arXiv:2410.02768v2 Announce Type: replace-cross Abstract: The development of multi-modal models has been rapidly advancing, with some demonstrating remarkable capabilities. However, annotating video-text pairs remains expensive and insufficient. Take video question answering (VideoQA) tasks as an example, human annotated questions and answers often cover only part of the video, since the corresponding text is often short and monotonous, leading to underutilization of video. To address this, we propose a Bootstrapping Video-Language Alignment framework (BoViLA), a self-training method that augments question samples during training process through LLM-based self-questioning and answering, which help model exploit video information and the internal knowledge of LLMs more thoroughly to improve modality alignment. However, low-quality self-generated questions may instead contaminate the performance, especially in the early stages of training, as we have observed in our experiments. To filter bad self-generated questions, we introduce Evidential Deep Learning (EDL) to estimate uncertainty and assess the quality of self-generated questions by evaluating the modality alignment within the context. To the best of our knowledge, this work is the first to explore LLM-based self-training frameworks for modality alignment. We evaluate BoViLA on five strong VideoQA benchmarks, where it outperforms several state-of-the-art methods and demonstrate its effectiveness and generality. Additionally, we provide extensive analyses of the self-training framework and the EDL-based uncertainty filtering mechanism. The code will be made available.

摘要

多模态模型的发展日新月异,部分模型已展现出卓越性能。然而视频-文本对的标注仍存在成本高昂且数据不足的问题。以视频问答(VideoQA)任务为例,人工标注的问答往往仅覆盖视频片段,因其对应文本通常简短单一,导致视频信息未被充分利用。为此,我们提出自举视频-语言对齐框架(BoViLA),这是一种通过基于大语言模型(LLM)的自提问-自回答机制在训练过程中扩充问题样本的自训练方法,促使模型更充分地挖掘视频信息与LLM内部知识以提升模态对齐效果。但实验发现,低质量的自生成问题可能污染模型性能,尤其在训练初期阶段。为过滤劣质自生成问题,我们引入证据深度学习(EDL)来评估不确定性,通过上下文中的模态对齐程度判定自生成问题的质量。据我们所知,本研究是首个探索基于LLM的自训练框架实现模态对齐的工作。我们在五个权威VideoQA基准上评估BoViLA,其性能超越多种前沿方法,验证了框架的有效性与泛化性。此外,我们对自训练框架及基于EDL的不确定性过滤机制进行了全面分析。代码将公开提供。


MAMMAL -- Molecular Aligned Multi-Modal Architecture and Language

Abstract

arXiv:2410.22367v3 Announce Type: replace-cross Abstract: Large language models applied to vast biological datasets have the potential to transform biology by uncovering disease mechanisms and accelerating drug development. However, current models are often siloed, trained separately on small-molecules, proteins, or transcriptomic data, limiting their ability to capture complex, multi-modal interactions. Effective drug discovery requires computational tools that integrate multiple biological entities while supporting prediction and generation, a challenge existing models struggle to address. For this purpose, we present MAMMAL - Molecular Aligned Multi-Modal Architecture and Language - a versatile method applied to create a multi-task foundation model that learns from large-scale biological datasets across diverse modalities, including proteins, small-molecules, and omics. MAMMAL's structured prompt syntax supports classification, regression, and generation tasks while handling token and scalar inputs and outputs. Evaluated on eleven diverse downstream tasks, it reaches a new state of the art (SOTA) in nine tasks and is comparable to SOTA in two tasks, all within a unified architecture, unlike prior task-specific models. Additionally, we explored Alphafold 3 binding prediction capabilities on antibody-antigen and nanobody-antigen complexes showing significantly better classification performance of MAMMAL in 3 out of 4 targets. The model code and pretrained weights are publicly available at https://github.com/BiomedSciAI/biomed-multi-alignment and https://huggingface.co/ibm/biomed.omics.bl.sm.ma-ted-458m

摘要

应用于海量生物数据集的大语言模型具有通过揭示疾病机制和加速药物研发来变革生物学的潜力。然而,当前模型往往孤立运行,仅针对小分子、蛋白质或转录组数据进行单独训练,限制了其捕捉复杂多模态相互作用的能力。有效的药物发现需要能整合多种生物实体并支持预测与生成的计算工具,这是现有模型难以应对的挑战。为此,我们提出MAMMAL(分子对齐多模态架构与语言)——一种通用方法,用于创建能从蛋白质、小分子和组学等多模态大规模生物数据中学习的多任务基础模型。MAMMAL的结构化提示语法支持分类、回归和生成任务,同时处理标记和标量输入输出。在11项多样化下游任务评估中,该模型在9项任务上达到最新技术水平(SOTA),在另外2项任务中与SOTA相当,且所有功能均集成于统一架构,不同于以往针对特定任务的模型。此外,我们探索了Alphafold 3在抗体-抗原和纳米抗体-抗原复合物上的结合预测能力,结果显示MAMMAL在4个靶标中有3个表现出显著更优的分类性能。模型代码与预训练权重已公开发布于https://github.com/BiomedSciAI/biomed-multi-alignmenthttps://huggingface.co/ibm/biomed.omics.bl.sm.ma-ted-458m。


Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?

Abstract

arXiv:2502.07963v3 Announce Type: replace-cross Abstract: Medical research faces well-documented challenges in translating novel treatments into clinical practice. Publishing incentives encourage researchers to present "positive" findings, even when empirical results are equivocal. Consequently, it is well-documented that authors often spin study results, especially in article abstracts. Such spin can influence clinician interpretation of evidence and may affect patient care decisions. In this study, we ask whether the interpretation of trial results offered by Large Language Models (LLMs) is similarly affected by spin. This is important since LLMs are increasingly being used to trawl through and synthesize published medical evidence. We evaluated 22 LLMs and found that they are across the board more susceptible to spin than humans. They might also propagate spin into their outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into plain language summaries that they generate. We also find, however, that LLMs are generally capable of recognizing spin, and can be prompted in a way to mitigate spin's impact on LLM outputs.

摘要

医学研究在将新疗法转化为临床实践过程中面临诸多已获实证的挑战。发表激励机制促使研究者呈现"阳性"结果,即便实证结论模棱两可。大量证据表明,作者常对研究结果进行倾向性表述(spin),尤其在论文摘要部分。这种倾向性表述可能影响临床医生对证据的解读,进而干扰诊疗决策。本研究探讨大型语言模型(LLMs)对试验结果的解读是否同样受倾向性表述影响——鉴于LLMs正被日益用于检索和整合已发表的医学证据,该问题至关重要。我们对22个LLM进行评估,发现所有模型均比人类更易受倾向性表述影响。模型还可能将倾向性传递至输出内容:例如有证据表明,LLMs会将其隐含融入生成的通俗摘要中。但我们也发现,LLMs普遍具备识别倾向性表述的能力,通过特定提示可减轻其对模型输出的影响。


MoM: Linear Sequence Modeling with Mixture-of-Memories

Abstract

arXiv:2502.13685v2 Announce Type: replace-cross Abstract: Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive downstream tasks. Drawing inspiration from neuroscience, particularly the brain's ability to maintain robust long-term memory while mitigating "memory interference", we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM significantly outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at https://github.com/OpenSparseLLMs/MoM and is also released as a part of https://github.com/OpenSparseLLMs/Linear-MoE.

摘要

线性序列建模方法(如线性注意力、状态空间建模和线性RNN)通过降低训练与推理复杂度显著提升了效率。然而,这些方法通常将整个输入序列压缩为单一固定大小的记忆状态,导致其在回忆密集型下游任务中表现欠佳。受神经科学启发(尤其是大脑在抑制"记忆干扰"的同时保持稳健长期记忆的能力),我们提出了一种称为混合记忆(Mixture-of-Memories, MoM)的新型架构。MoM采用多个独立记忆状态,并通过路由网络将输入令牌定向至特定记忆状态。该方法在最大限度减少记忆干扰的同时,显著提升了整体记忆容量。因此,MoM在回忆密集型任务中表现卓越,超越了现有线性序列建模技术。尽管引入多重记忆状态,每个记忆状态的计算仍保持线性复杂度,使得MoM在训练时保留线性复杂度优势,在推理时保持恒定复杂度。实验结果表明,MoM在下游语言任务(尤其是回忆密集型任务)上显著优于当前线性序列模型,甚至达到与Transformer模型相当的性能。代码发布于https://github.com/OpenSparseLLMs/MoM,并作为https://github.com/OpenSparseLLMs/Linear-MoE的组成部分同步开源。


BRIDGE: Bootstrapping Text to Control Time-Series Generation via Multi-Agent Iterative Optimization and Diffusion Modelling

Abstract

arXiv:2503.02445v3 Announce Type: replace-cross Abstract: Time-series Generation (TSG) is a prominent research area with broad applications in simulations, data augmentation, and counterfactual analysis. While existing methods have shown promise in unconditional single-domain TSG, real-world applications demand for cross-domain approaches capable of controlled generation tailored to domain-specific constraints and instance-level requirements. In this paper, we argue that text can provide semantic insights, domain information and instance-specific temporal patterns, to guide and improve TSG. We introduce ``Text-Controlled TSG'', a task focused on generating realistic time series by incorporating textual descriptions. To address data scarcity in this setting, we propose a novel LLM-based Multi-Agent framework that synthesizes diverse, realistic text-to-TS datasets. Furthermore, we introduce BRIDGE, a hybrid text-controlled TSG framework that integrates semantic prototypes with text description for supporting domain-level guidance. This approach achieves state-of-the-art generation fidelity on 11 of 12 datasets, and improves controllability by 12.52% on MSE and 6.34% MAE compared to no text input generation, highlighting its potential for generating tailored time-series data.

摘要

时间序列生成(TSG)作为仿真模拟、数据增强和反事实分析等领域的重要研究方向,已展现出广阔的应用前景。尽管现有方法在无条件单领域TSG中表现良好,但实际应用需要能够根据领域特定约束和实例级需求进行可控生成的跨领域方法。本文提出,文本可提供语义洞察、领域信息及实例特异性时间模式,从而指导并改进TSG。我们引入"文本控制TSG"这一新任务,其核心是通过整合文本描述来生成真实时间序列。针对该场景下的数据稀缺问题,我们提出一种基于大语言模型的多智能体框架,可合成多样化的真实文本-TS配对数据集。进一步,我们开发了混合框架BRIDGE,通过将语义原型与文本描述相结合来实现领域级引导。该方法在12个数据集中有11个达到最先进的生成保真度,与无文本输入生成相比,控制性能平均提升12.52%(MSE)和6.34%(MAE),凸显了其在定制化时间序列数据生成方面的潜力。


Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

Abstract

arXiv:2503.06269v2 Announce Type: replace-cross Abstract: Traditional white-box methods for creating adversarial perturbations against LLMs typically rely only on gradient computation from the targeted model, ignoring the internal mechanisms responsible for attack success or failure. Conversely, interpretability studies that analyze these internal mechanisms lack practical applications beyond runtime interventions. We bridge this gap by introducing a novel white-box approach that leverages mechanistic interpretability techniques to craft practical adversarial inputs. Specifically, we first identify acceptance subspaces - sets of feature vectors that do not trigger the model's refusal mechanisms - then use gradient-based optimization to reroute embeddings from refusal subspaces to acceptance subspaces, effectively achieving jailbreaks. This targeted approach significantly reduces computation cost, achieving attack success rates of 80-95% on state-of-the-art models including Gemma2, Llama3.2, and Qwen2.5 within minutes or even seconds, compared to existing techniques that often fail or require hours of computation. We believe this approach opens a new direction for both attack research and defense development. Furthermore, it showcases a practical application of mechanistic interpretability where other methods are less efficient, which highlights its utility. The code and generated datasets are available at https://github.com/Sckathach/subspace-rerouting.

摘要

传统针对大语言模型生成对抗扰动的白盒方法通常仅依赖于目标模型的梯度计算,忽视了决定攻击成败的内部机制。而分析这些内部机制的可解释性研究,除运行时干预外缺乏实际应用场景。本研究通过引入一种结合机理可解释性技术的新型白盒方法,构建了具有实用价值的对抗输入。具体而言,我们首先识别接受子空间——即不会触发模型拒绝机制的特征向量集合,随后基于梯度优化将嵌入向量从拒绝子空间重定向至接受子空间,从而实现高效越狱。这种定向方法显著降低了计算成本,在Gemma2、Llama3.2和Qwen2.5等前沿模型上实现了80-95%的攻击成功率,耗时仅需数分钟甚至秒级,而现有技术往往失败或需数小时计算。我们认为该方法为攻击研究和防御开发开辟了新方向,同时展示了机理可解释性在其他方法效率低下场景中的实用价值。相关代码与生成数据集已发布于https://github.com/Sckathach/subspace-rerouting。


CALLM: Understanding Cancer Survivors' Emotions and Intervention Opportunities via Mobile Diaries and Context-Aware Language Models

Abstract

arXiv:2503.10707v2 Announce Type: replace-cross Abstract: Cancer survivors face unique emotional challenges that impact their quality of life. Mobile diary entries provide a promising method for tracking emotional states, improving self-awareness, and promoting well-being outcome. This paper aims to, through mobile diaries, understand cancer survivors' emotional states and key variables related to just-in-time intervention opportunities, including the desire to regulate emotions and the availability to engage in interventions. Although emotion analysis tools show potential for recognizing emotions from text, current methods lack the contextual understanding necessary to interpret brief mobile diary narratives. Our analysis of diary entries from cancer survivors (N=407) reveals systematic relationships between described contexts and emotional states, with administrative and health-related contexts associated with negative affect and regulation needs, while leisure activities promote positive emotions. We propose CALLM, a Context-Aware framework leveraging Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) to analyze these brief entries by integrating retrieved peer experiences and personal diary history. CALLM demonstrates strong performance with balanced accuracies reaching 72.96% for positive affect, 73.29% for negative affect, 73.72% for emotion regulation desire, and 60.09% for intervention availability, outperforming language model baselines. Post-hoc analysis reveals that model confidence strongly predicts accuracy, with longer diary entries generally enhancing performance, and brief personalization periods yielding meaningful improvements. Our findings demonstrate how contextual information in mobile diaries can be effectively leveraged to understand emotional experiences, predict key states, and identify optimal intervention moments for personalized just-in-time support.

摘要

癌症幸存者面临独特的情感挑战,这些挑战影响着他们的生活质量。移动日记记录为追踪情绪状态、提升自我意识和改善健康结果提供了有效途径。本文旨在通过移动日记理解癌症幸存者的情绪状态,以及与即时干预机会相关的关键变量,包括情绪调节意愿和参与干预的可用性。尽管情感分析工具在从文本识别情绪方面展现出潜力,但现有方法缺乏解读简短移动日记叙述所需的上下文理解能力。我们对407名癌症幸存者的日记分析表明,描述情境与情绪状态存在系统性关联:行政事务和健康相关情境与负面情绪及调节需求相关,而休闲活动则促进积极情绪。我们提出CALLM框架,该上下文感知系统通过结合检索增强生成技术和大语言模型,整合同类经历和个人日记历史来分析简短记录。CALLM展现出优异性能,在积极情绪(72.96%)、消极情绪(73.29%)、情绪调节意愿(73.72%)和干预可用性(60.09%)的平衡准确率上均超越基线语言模型。事后分析显示模型置信度能有效预测准确性,较长日记通常提升性能,短暂个性化周期即可带来显著改进。本研究证明了如何有效利用移动日记中的情境信息来理解情感体验、预测关键状态,并为个性化即时支持确定最佳干预时机。


Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

Abstract

arXiv:2503.13551v3 Announce Type: replace-cross Abstract: Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate step. In addition, the cost of annotating reasoning processes for reward modeling is high, making large-scale collection of high-quality data challenging. To address this, we propose a novel reward model approach called the Hierarchical Reward Model (HRM), which evaluates both individual and consecutive reasoning steps at both fine-grained and coarse-grained levels. HRM excels at assessing multi-step reasoning coherence, especially when flawed steps are later corrected through self-reflection. To further reduce the cost of generating training data, we introduce a lightweight and effective data augmentation strategy called Hierarchical Node Compression (HNC), which merges two consecutive reasoning steps into one within the tree structure. By applying HNC to MCTS-generated reasoning trajectories, we enhance the diversity and robustness of HRM training data while introducing controlled noise with minimal computational overhead. Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM. Furthermore, cross-domain evaluations on the MATH500 and GSM8K datasets demonstrate HRM's strong generalization and robustness across a variety of reasoning tasks.

摘要

近期研究表明,大型语言模型(LLMs)通过监督微调或强化学习获得了强大的推理能力。然而,过程奖励模型(PRM)这一关键方法存在奖励破解问题,导致其识别最佳中间步骤的可靠性不足。此外,为奖励建模标注推理过程的成本高昂,使得大规模收集高质量数据具有挑战性。为解决这些问题,我们提出了一种名为分层奖励模型(HRM)的新型奖励建模方法,该方法在细粒度和粗粒度两个层面上对单个及连续推理步骤进行评估。HRM尤其擅长评估多步推理的连贯性,特别是在错误步骤通过自我反思得到修正的情况下。为进一步降低训练数据生成成本,我们提出了一种轻量级高效的数据增强策略——分层节点压缩(HNC),该策略在树状结构中合并两个连续推理步骤。通过将HNC应用于蒙特卡洛树搜索生成的推理轨迹,我们在引入可控噪声的同时,以最小计算开销提升了HRM训练数据的多样性和鲁棒性。PRM800K数据集的实证结果表明,HRM与HNC相结合能提供比PRM更稳定可靠的评估结果。此外,在MATH500和GSM8K数据集上的跨领域评估表明,HRM在各类推理任务中均展现出强大的泛化能力和鲁棒性。


HAIR: Hardness-Aware Inverse Reinforcement Learning with Introspective Reasoning for LLM Alignment

Abstract

arXiv:2503.18991v2 Announce Type: replace-cross Abstract: The alignment of large language models (LLMs) with human values remains critical yet hindered by four key challenges: (1) scarcity of balanced safety datasets, (2) alignment tax, (3) vulnerability to jailbreak attacks due to shallow alignment, and (4) inability to dynamically adapt rewards according to task difficulty. To address these limitations, we introduce HAIR (Hardness-Aware Inverse Reinforcement Learning with Introspective Reasoning), a novel alignment approach inspired by shadow models in membership inference attacks. Our approach consists of two main components: (1) construction of a balanced safety Chain-of-Draft (CoD) dataset for seven harmful categories using structured prompts that leverage the introspective reasoning capabilities of LLMs; and (2) training of category-specific reward models with Group Relative Policy Optimization (GRPO), dynamically tuning optimization to task difficulty at both the data and model levels. Comprehensive experiments across four harmlessness and four usefulness benchmarks demonstrate that HAIR achieves state-of-the-art performance, outperforming all baseline methods in safety while maintaining high levels of usefulness.

摘要

大型语言模型(LLMs)与人类价值观的对齐仍然至关重要,但面临四个关键挑战:(1)缺乏平衡的安全性数据集;(2)对齐税;(3)由于浅层对齐而易受越狱攻击;(4)无法根据任务难度动态调整奖励。为解决这些限制,我们提出了HAIR(基于难度感知的反向强化学习与自省推理),这是一种受成员推理攻击中影子模型启发的新型对齐方法。我们的方法包含两个主要部分:(1)利用LLMs的自省推理能力,通过结构化提示构建涵盖七种有害类别的平衡安全性Chain-of-Draft(CoD)数据集;(2)使用组相对策略优化(GRPO)训练特定类别的奖励模型,在数据和模型层面动态调整优化以适应任务难度。在四个无害性和四个有用性基准上的综合实验表明,HAIR实现了最先进的性能,在安全性上优于所有基线方法,同时保持了高水平的实用性。


Catch Me if You Search: When Contextual Web Search Results Affect the Detection of Hallucinations

Abstract

arXiv:2504.01153v3 Announce Type: replace-cross Abstract: While we increasingly rely on large language models (LLMs) for various tasks, these models are known to produce inaccurate content or `hallucinations' with potentially disastrous consequences. The recent integration of web search results into LLMs prompts the question of whether people utilize them to verify the generated content, thereby accurately detecting hallucinations. An online experiment (N = 560) investigated how the provision of search results, either static (i.e., fixed search results provided by LLM) or dynamic (i.e., participant-led searches), affects participants' perceived accuracy of LLM-generated content (i.e., genuine, minor hallucination, major hallucination), self-confidence in accuracy ratings, as well as their overall evaluation of the LLM, as compared to the control condition (i.e., no search results). Results showed that participants in both static and dynamic conditions (vs. control) rated hallucinated content to be less accurate and perceived the LLM more negatively. However, those in the dynamic condition rated genuine content as more accurate and demonstrated greater overall self-confidence in their assessments than those in the static search or control conditions. We highlighted practical implications of incorporating web search functionality into LLMs in real-world contexts.

摘要

随着我们日益依赖大语言模型(LLMs)执行各类任务,这些模型会产生不准确内容或"幻觉"的问题已广为人知,并可能引发灾难性后果。近期将网络搜索结果整合至LLMs的做法,促使我们探究人们是否会利用这些结果来验证生成内容,从而准确识别幻觉。一项在线实验(N=560)比较了静态搜索条件(即LLM提供的固定搜索结果)、动态搜索条件(即参与者自主搜索)与控制条件(无搜索结果)下,参与者对LLM生成内容(包括真实内容、轻微幻觉和严重幻觉)准确性感知、判断自信度以及对LLM整体评价的差异。结果表明:相较于控制组,静态和动态搜索条件下的参与者均更倾向于判定幻觉内容不准确,并对LLM持更负面评价。但动态搜索组的参与者不仅对真实内容的准确性评分更高,其判断自信度也显著优于静态搜索组和控制组。我们进一步探讨了在实际应用中为LLMs集成网络搜索功能的实践意义。


SEAL: Steerable Reasoning Calibration of Large Language Models for Free

Abstract

arXiv:2504.07986v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs), such as OpenAI's o1-series have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism. However, recent studies reveal substantial redundancy in the CoT reasoning traces, which not only increases inference latency but also negatively impacts model performance by diverting attention to unnecessary reasoning paths. To address this issue, we investigate the internal reasoning structures of LLMs and categorize them into three primary thought types: execution, reflection, and transition thoughts. Moreover, our analysis reveals that excessive reflection and transition thoughts are strongly correlated with failure cases and these thought categories exhibit clear separation in the latent space. Based on these, we introduce SEAL (Steerable reasoning calibration), a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains. SEAL consists of an offline stage for extracting the reasoning steering vector in the latent space, followed by an on-the-fly calibration of the reasoning trace through representation intervention using the steering vector. Notably, the steering vector exhibits strong transferability across various tasks. Extensive experiments across multiple models (DeepSeek-R1-Distill and QwQ-32B-Preview) and benchmarks (Math500, GSM8K, LiveCodeBench) validate the effectiveness of SEAL, up to a 11% improvement in accuracy while reducing reasoning tokens by 11.8% to 50.4%. Our code is publicly available at https://github.com/VITA-Group/SEAL.

摘要

大型语言模型(LLMs),如OpenAI的o1系列,通过扩展的思维链(CoT)推理机制,在复杂推理任务中展现出卓越能力。然而,近期研究表明CoT推理轨迹存在显著冗余,这不仅增加了推理延迟,还会因注意力分散至不必要的推理路径而降低模型性能。为解决该问题,我们探究了LLMs的内部推理结构,将其归纳为三种核心思维类型:执行思维、反思思维和转换思维。进一步分析表明,过量的反思与转换思维与推理失败案例高度相关,且这些思维类别在潜在空间中呈现明显分离。基于此,我们提出SEAL(可调控推理校准),一种无需训练的方法,可无缝校准CoT过程,在提升准确率的同时显著提高效率。SEAL包含离线阶段(提取潜在空间中的推理导向向量)和在线阶段(利用该向量通过表征干预实时校准推理轨迹)。值得注意的是,该导向向量在不同任务间表现出强迁移性。在多个模型(DeepSeek-R1-Distill和QwQ-32B-Preview)及基准测试(Math500、GSM8K、LiveCodeBench)上的大量实验验证了SEAL的有效性,最高可提升11%的准确率,同时减少11.8%至50.4%的推理标记量。代码已开源:https://github.com/VITA-Group/SEAL。


CCSK:Cognitive Convection of Self-Knowledge Based Retrieval Augmentation for Large Language Models

Abstract

arXiv:2504.10498v3 Announce Type: replace-cross Abstract: The performance of large language models (LLMs) in Q&A task increased substantially through Retrieval-Augmented Generation (RAG) which brings in external knowledge. However, the main difficulty lies in balancing the inherent self-knowledge of LLMs with external information retrieval (IR). The current threshold-based methods apply one-dimensional static mechanisms with single criterion. As a result, their IR decisions might be irrelevant to the LLMs' response under difficult queries. To alleviate this problem, we propose Cognitive Convection of Self-Knowledge (CCSK). Different from traditional methods that maintain single fixed IR activation criteria, CCSK implements a dynamic joint decision process via a Siamese Network module and a Response Quality Model. The Siamese Network calculates the cosine similarity between the current query and the historical queries. The Response Quality Model evaluates the responses of LLMs through LightGBM. The final decision of the CCSK is derived from the outputs of the two modules, as well as text features fused using a multi-head attention mechanism. Extensive experiments on real-world datasets show that CCSK significantly enhances the model's effectiveness in information retrieval.

摘要

大型语言模型(LLMs)在问答任务中的性能通过引入外部知识的检索增强生成(RAG)得到显著提升,但其核心挑战在于平衡模型固有知识与外部信息检索(IR)。现有基于阈值的方法采用单一标准的静态一维机制,导致处理复杂查询时检索决策可能与模型响应无关。为此,我们提出自知识认知对流(CCSK)方法。不同于传统固定检索激活准则的方案,CCSK通过孪生网络模块和响应质量模型实现动态联合决策:孪生网络计算当前查询与历史查询的余弦相似度,响应质量模型基于LightGBM评估模型响应质量,最终决策综合两个模块输出及经多头注意力机制融合的文本特征生成。真实场景数据集实验表明,CCSK能显著提升模型信息检索效能。


Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization

Abstract

arXiv:2504.13460v3 Announce Type: replace-cross Abstract: Traditional temporal action localization (TAL) methods rely on large amounts of detailed annotated data, whereas few-shot TAL reduces this dependence by using only a few training samples to identify unseen action categories. However, existing few-shot TAL methods typically focus solely on video-level information, neglecting textual information, which can provide valuable semantic support for the localization task. Therefore, we propose a new few-shot temporal action localization method by Chain-of-Thought textual reasoning to improve localization performance. Specifically, we design a novel few-shot learning framework that leverages textual semantic information to enhance the model's ability to capture action commonalities and variations, which includes a semantic-aware text-visual alignment module designed to align the query and support videos at different levels. Meanwhile, to better express the temporal dependencies and causal relationships between actions at the textual level to assist action localization, we design a Chain of Thought (CoT)-like reasoning method that progressively guides the Vision Language Model (VLM) and Large Language Model (LLM) to generate CoT-like text descriptions for videos. The generated texts can capture more variance of action than visual features. We conduct extensive experiments on the publicly available ActivityNet1.3 and THUMOS14 datasets. We introduce the first dataset named Human-related Anomaly Localization and explore the application of the TAL task in human anomaly detection. The experimental results demonstrate that our proposed method significantly outperforms existing methods in single-instance and multi-instance scenarios. We will release our code, data and benchmark.

摘要

传统时序动作定位(TAL)方法依赖大量精细标注数据,而小样本TAL仅需少量训练样本即可识别未见动作类别,从而降低这种依赖性。然而现有小样本TAL方法通常仅关注视频级信息,忽视了可为定位任务提供有价值语义支持的文本信息。为此,我们提出一种基于思维链文本推理的新型小样本时序动作定位方法以提升定位性能。具体而言,我们设计了一个利用文本语义信息增强模型捕捉动作共性与变化能力的新型小样本学习框架,其中包含专为对齐查询视频和支持视频多层级特征而设计的语义感知文本-视觉对齐模块。同时,为在文本层面更好表达动作间时序依赖与因果关系以辅助动作定位,我们设计了类思维链(CoT)推理方法,逐步引导视觉语言模型(VLM)和大语言模型(LLM)生成视频的类CoT文本描述。相比视觉特征,生成文本能捕捉更丰富的动作变化特征。我们在公开数据集ActivityNet1.3和THUMOS14上进行了大量实验,并首次构建了名为"人类相关异常定位"的数据集,探索了TAL任务在人类异常检测中的应用。实验结果表明,在单实例和多实例场景下,我们提出的方法均显著优于现有方法。我们将公开代码、数据及基准测试结果。


Pushing the boundary on Natural Language Inference

Abstract

arXiv:2504.18376v2 Announce Type: replace-cross Abstract: Natural Language Inference (NLI) is a central task in natural language understanding with applications in fact-checking, question answering, and information retrieval. Despite its importance, current NLI systems heavily rely on supervised learning with datasets that often contain annotation artifacts and biases, limiting generalization and real-world applicability. In this work, we apply a reinforcement learning-based approach using Group Relative Policy Optimization (GRPO) for Chain-of-Thought (CoT) learning in NLI, eliminating the need for labeled rationales and enabling this type of training on more challenging datasets such as ANLI. We fine-tune 7B, 14B, and 32B language models using parameter-efficient techniques (LoRA and QLoRA), demonstrating strong performance across standard and adversarial NLI benchmarks. Our 32B AWQ-quantized model surpasses state-of-the-art results on 7 out of 11 adversarial sets\unicode&lbrace;x2013&rbrace;or on all of them considering our replication\unicode&lbrace;x2013&rbrace;within a 22GB memory footprint, showing that robust reasoning can be retained under aggressive quantization. This work provides a scalable and practical framework for building robust NLI systems without sacrificing inference quality.

摘要

自然语言推理(NLI)是自然语言理解的核心任务,应用于事实核查、问答系统和信息检索等领域。尽管其重要性显著,现有NLI系统严重依赖监督学习,而所使用的数据集常包含标注伪影和偏差,制约了模型的泛化能力和实际应用效果。本研究采用基于强化学习的方法,通过群体相对策略优化(GRPO)实现NLI中的思维链(CoT)学习,无需标注推理依据即可在ANLI等高难度数据集上开展训练。我们使用参数高效技术(LoRA和QLoRA)对70亿、140亿和320亿参数的语言模型进行微调,在标准及对抗性NLI基准测试中均展现出卓越性能。其中320亿参数的AWQ量化模型在11个对抗性数据集中有7个超越现有最优结果(若计入我们的复现实验则全部超越),仅需22GB内存占用,证明激进量化条件下仍可保持稳健的推理能力。本研究为构建无需牺牲推理质量的鲁棒NLI系统提供了可扩展的实用框架。


RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents

Abstract

arXiv:2504.18565v2 Announce Type: replace-cross Abstract: Uncontrollable autonomous replication of language model agents poses a critical safety risk. To better understand this risk, we introduce RepliBench, a suite of evaluations designed to measure autonomous replication capabilities. RepliBench is derived from a decomposition of these capabilities covering four core domains: obtaining resources, exfiltrating model weights, replicating onto compute, and persisting on this compute for long periods. We create 20 novel task families consisting of 86 individual tasks. We benchmark 5 frontier models, and find they do not currently pose a credible threat of self-replication, but succeed on many components and are improving rapidly. Models can deploy instances from cloud compute providers, write self-propagating programs, and exfiltrate model weights under simple security setups, but struggle to pass KYC checks or set up robust and persistent agent deployments. Overall the best model we evaluated (Claude 3.7 Sonnet) has a >50% pass@10 score on 15/20 task families, and a >50% pass@10 score for 9/20 families on the hardest variants. These findings suggest autonomous replication capability could soon emerge with improvements in these remaining areas or with human assistance.

摘要

语言模型代理的不可控自主复制构成重大安全风险。为深入理解该风险,我们推出RepliBench评估套件,用于系统测量自主复制能力。该套件通过能力解构覆盖四个核心领域:资源获取、模型权重窃取、算力平台复制及长期运行维持。我们创建了包含86项具体任务的20个新型任务族,并对5个前沿模型进行基准测试。结果表明当前模型尚不具备可信的自我复制威胁,但在多个组件上表现优异且进步迅速。这些模型能够从云计算提供商部署实例、编写自我传播程序,并在简单安全设置下窃取模型权重,但难以通过KYC验证或建立鲁棒的持久化代理部署。总体而言,表现最佳的模型(Claude 3.7 Sonnet)在15/20任务族上获得>50%的pass@10分数,在最难变体任务中9/20任务族达到同等水平。这些发现表明,随着剩余短板的改进或人类辅助,自主复制能力可能即将出现。


Llama-Nemotron: Efficient Reasoning Models

Abstract

arXiv:2505.00949v2 Announce Type: replace-cross Abstract: We introduce the Llama-Nemotron series of models, an open family of heterogeneous reasoning models that deliver exceptional reasoning capabilities, inference efficiency, and an open license for enterprise use. The family comes in three sizes -- Nano (8B), Super (49B), and Ultra (253B) -- and performs competitively with state-of-the-art reasoning models such as DeepSeek-R1 while offering superior inference throughput and memory efficiency. In this report, we discuss the training procedure for these models, which entails using neural architecture search from Llama 3 models for accelerated inference, knowledge distillation, and continued pretraining, followed by a reasoning-focused post-training stage consisting of two main parts: supervised fine-tuning and large scale reinforcement learning. Llama-Nemotron models are the first open-source models to support a dynamic reasoning toggle, allowing users to switch between standard chat and reasoning modes during inference. To further support open research and facilitate model development, we provide the following resources: 1. We release the Llama-Nemotron reasoning models -- LN-Nano, LN-Super, and LN-Ultra -- under the commercially permissive NVIDIA Open Model License Agreement. 2. We release the complete post-training dataset: Llama-Nemotron-Post-Training-Dataset. 3. We also release our training codebases: NeMo, NeMo-Aligner, and Megatron-LM.

摘要

我们推出Llama-Nemotron系列模型,这是一个开放的异构推理模型家族,具有卓越的推理能力、高效的推理效率以及允许企业使用的开放许可。该系列包含三种规模——Nano(8B)、Super(49B)和Ultra(253B)——其性能可与DeepSeek-R1等最先进的推理模型相媲美,同时提供更优的推理吞吐量和内存效率。本报告讨论了这些模型的训练过程,包括基于Llama 3模型进行神经架构搜索以实现加速推理、知识蒸馏和持续预训练,随后是由监督微调和大规模强化学习两部分组成的专注于推理的后训练阶段。Llama-Nemotron是首个支持动态推理切换的开源模型,允许用户在推理过程中切换标准聊天和推理模式。为支持开放研究并促进模型开发,我们提供以下资源:1. 在商业友好的NVIDIA开放模型许可协议下发布Llama-Nemotron推理模型——LN-Nano、LN-Super和LN-Ultra;2. 公开完整的后训练数据集Llama-Nemotron-Post-Training-Dataset;3. 同步发布训练代码库NeMo、NeMo-Aligner和Megatron-LM。